Business Context and Project ScenarioΒΆ

The Bank of England prudentially regulates and supervises financial services firms through the Prudential Regulation Authority (PRA). The PRA are responsible for the prudential regulation and supervision of around 1,500 banks, building societies, credit unions, insurers, and major investment firms. To achieve this, the PRA examines various data sources, some more accessible than others. Quarterly result announcements are a particularly challenging data source to analyse. Traditional data science methods struggle to fully utilise this type of data because of the following two reasons:

  • They are unstructured in the form of text and/or video/webcasts.
  • They are complex, requiring technical and financial background knowledge.

This project aims to enhance the use of these data sets to improve our risk assessment of individual firms and, in doing so, maintain financial stability.

The Bank of England’s RegTech, Data and Innovation team (the β€˜Team’) would like to understand if the quarterly result announcements provide additional information or insights on a firm. This includes, but is not limited to, the following:

  1. Topic modelling and sentiment analysis: Using a mix of text pre-processing and pre-trained language models (e.g. FinBERT and frameworks like BERTopic), is it possible to cluster the key topics raised by industry analysts and the sentiments related to those topics during Q&A sessions with the senior management team (e.g. in the earnings call Q&A transcripts)?
  2. Information summarisation: Using a mix of text pre-processing and pre-trained language models, can language models be used to extract and summarise key takeaways raised in these transcripts? Some text preprocessing/intermediary language model pipelines would probably be needed before generating summaries themselves so that the summaries are grouped in a manner that makes sense. Example groups could be:
  • By topic; for example, your choice of specific issues raised by analysts (two or three topics would do, although you are welcome to cover more ground)
  • By specific metrics; for example, a summary of all instances related to metric X, a summary of all instances related to metric Y, etc. (two or three metrics would do, although you are welcome to cover more metrics)
  • By speaker; for example, analyst or banker.

The team is also keen to explore additional methods for extracting value from these data sources. This includes new technical approaches using GenAI/language models, as well as innovative ways to analyse and compare data, such as benchmarking a firm against its peers.

As the PRA regulates many institutions, the Team proposes focusing on one or two banks which have been identified in the list of global systemically important banks (G-SIBs).

For this project, call transcripts from JPMorgan Chase were selected foe initial development and assessment of methodologies.

Setup: Libraries & PackagesΒΆ

InΒ [Β ]:
%pip install bertopic
%pip install prettytable
%pip install scikit-learn
%pip install sentence-transformers
%pip install spacy
%pip install venn
%pip install PyMuPDF
%pip install -U -q PyDrive
!python -m spacy download en_core_web_sm

# standard library imports
import math
import time
import os
import re
import string

from collections import Counter

# third-party imports
import csv
import fitz
import json
import nltk
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
import seaborn as sns
import spacy
import tensorflow_hub as hub
import torch
import zipfile


from bertopic import BERTopic
from dateutil import parser
from hdbscan import HDBSCAN
from ipywidgets import Button, HBox, Output
from google.colab import drive, files, userdata
from matplotlib import pyplot as plt
from matplotlib_venn import venn2, venn3
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from prettytable import PrettyTable
from scipy.stats import gaussian_kde, linregress, mannwhitneyu, pearsonr
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, mean_squared_error, ConfusionMatrixDisplay
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from umap import UMAP
from wordcloud import WordCloud
from IPython.display import display, HTML

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

Raw Data folder: contains the raw data, financial metrics data

Processed Data folder contains evaluation data sets, cleaned and tabularised data and Phi-3.5 summarised data

Output Data folder: contains files created by various models for analysis

InΒ [Β ]:
# Get the base repository folder by going one level up from the notebooks folder
base_folder = os.path.abspath(os.path.join(os.getcwd(), "..", "data"))

# Define the path for each subfolder
raw_data_folder = os.path.join(base_folder, "Raw_Data")
processed_data_folder = os.path.join(base_folder, "Processed_Data")
output_data_folder = os.path.join(base_folder, "Output_Data")


# Create the directories if they do not exist
for folder in [raw_data_folder, processed_data_folder, output_data_folder]:
    os.makedirs(folder, exist_ok=True)

0 Exploration of Financial Metrics DataΒΆ

InΒ [Β ]:
metrics_path = raw_data_folder + '/key_financial_metrics_JPMorgan_clean.xlsx'
metrics_df = pd.read_excel(metrics_path)

metrics_df['date'] = pd.to_datetime(metrics_df['date'])
InΒ [Β ]:
# Getting the unique metric types
unique_metric_types = metrics_df['metric_type'].unique()

# Defining colors for each plot
colors = ['blue', 'green', 'red', 'purple']

# Plotting in a 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(18, 9))

# Flattening axes array for easy iteration
axes = axes.flatten()

# Looping through metric types and corresponding axes
for i, metric in enumerate(unique_metric_types):
    subset = metrics_df[metrics_df['metric_type'] == metric]
    axes[i].plot(subset['Q&FY'], subset['metric_value'], marker='o', color=colors[i])
    axes[i].set_title(f'Metric: {metric}', fontsize=16, color=colors[i])
    axes[i].set_xlabel('Quarter', fontsize=14)
    axes[i].set_ylabel('Metric Value', fontsize=14)
    axes[i].tick_params(axis='x', labelsize=12, rotation=45)
    axes[i].tick_params(axis='y', labelsize=12)
    axes[i].grid(visible=True, linestyle='--', alpha=0.5)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()
No description has been provided for this image

Picking out two interesting dates to explore based on the charts above of key financial metrics:

1Q22

In 1Q22, the bank faced significant financial challenges, marked by a low CET1 capital ratio and declining net income and EPS. This suggests reduced profitability and shareholder returns. The increase in provisions for credit losses suggests heightened caution regarding potential loan defaults. Overall, this quarter indicates a period of financial strain and risk management.

2Q24

In 2Q24, the bank reached peak performance across several key metrics, with its CET1 capital ratio, net income, and EPS all at their highest levels. This suggests a strong capital position, robust profitability, and high returns to shareholders. Provisions for credit losses, however, remained elevated. This indicates a continued cautious approach toward potential credit risks. Overall, this quarter indicates a period of strong financial results although the bank is maintaining caution to guard against possible economic uncertainties or loan defaults.

1 Data Collection and Pre-Processing of TranscriptsΒΆ

Firstly, we downloaded all the earning calls transcripts from 2021Q2 to 2024Q3 from JPMorgan's website, and saved them as a zip file in the raw data folder.

1.1 Extract information from transcripts pdf, put into a table and output as a csvΒΆ

The following code performs batch-processing of PDF files without specifying links to individual documents.

1.1.0 Helper functionsΒΆ

InΒ [Β ]:
def get_valid_date(text_string):
    """
    helper function to check if a string is a valid date

    PARAMS:
        text_string (str) : some text

    RETURNS:
        (bool) : whether the text is a valid date
        parsed_date (datetime.datetime) : date if detected, None otherwise
    """
    try:
        parsed_date = parser.parse(text_string)
        return True, parsed_date
    except (ValueError, OverflowError):
        return False, None

def detect_date_in_list_of_strings(text_lines):
    """
    detects a valid date given a list of strings

    raises an error if a valid date is not detected

    PARAMS:
        text_lines (list) : list of strings

    RETURNS:
        date (datetime.datetime) : the detected date if present
    """
    try: # expect date to be on the last line
        valid_date, date = get_valid_date(text_lines[-1])
    except:
        for line in text_lines[:-1]:
          valid_date, date = get_valid_date(line)
          if valid_date:
              break
    if not valid_date:
        raise ValueError(f"Could not find a valid date!")

    return date

def detect_pattern(text_lines, pattern, unique=True):
    """
    detects the only (unique=True) or all occurrences of a string pattern
    given a list of strings

    raises an error if the pattern is not detected

    PARAMS:
        text_lines (list) : list of strings
        pattern (str) : a string to be detected in the provided strings
        unique (bool) : whether the pattern must be present in exactly one
                        string in text_lines; defaults to True

    RETURNS:
        detected_text (str or list) : Text containing the pattern (if present)
          If unique is True, there must be only one occurrence - returns a string
          If unique is False, returns a list of all occurrences
    """
    # expect the pattern to be in the first item
    if pattern in text_lines[0]:
        return text_lines[0]

    detected_texts = [text for text in text_lines if pattern in text]

    if unique:
        if len(detected_texts)>1:
            raise ValueError("More than one occurrence of the pattern detected!")
        elif not detected_texts:
            raise ValueError("Pattern not detected!")
        return detected_texts[0]  # return the first (and only) detected text
    else:
        return detected_texts  # return all occurrences (even if empty list)

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

def is_valid_word(word):
    """
    checks if a word is title case, fully uppercase (acronym) or mixed-case

    PARAMS:
      word (str)

    RETURNS:
      bool
    """
    return (
        word.istitle() or
        word.isupper() or
        re.match(r'^[A-Z][a-zA-Z0-9]*$', word)
    )

def get_all_caps_lines(text_lines):
  """
  given a list of strings, returns the strings where all words are title case,
  upper case (e.g. acronyms), or mixed-case. Small words, like "a", "or", "for",
  are ignored, so "Bank of America" still gets returned

  PARAMS:
    text_lines (list of strs) : list that contains any sort of strings

  RETURNS:
    all_caps_lines (list of strs) : list of strings where all words (separated by spaces)
                                    are title-, upper-, or mixed-case
  """
  all_caps_lines = []
  for line in text_lines:
      # remove punctuation
      line_clean = remove_punctuation(line)

      # remove small words that are typically not capitalised
      small_words = [
        'a', 'an', 'the', 'and', 'but', 'or', 'for', 'nor', 'so',
        'yet', 'at', 'by', 'for', 'from', 'in', 'of', 'on', 'to', 'with'
      ]
      words = line_clean.split()
      filtered_words = [word for word in words if word.lower() not in small_words]
      line_clean = ' '.join(filtered_words)

      # check if all words are capitalised (as in names or titles)
      if all(is_valid_word(word) for word in line_clean.split()):
          all_caps_lines.append(line)

  return all_caps_lines

def remove_preceding_numbers(text):
    """
    replaces leading numbers in a string with an empty string
    """
    return re.sub(r'^\d+\s*', '', text)

1.1.1 Unzip files and define the path to PDF file folderΒΆ

InΒ [Β ]:
# Path to the zipped file
zip_file_path = os.path.join(raw_data_folder, "JPMorgan.zip")

# Path to the sub-directory for extracted files
jpm_folder = os.path.join(raw_data_folder, "JPMorgan")

# Create the target folder if it doesn't exist
os.makedirs(jpm_folder, exist_ok=True)

# Unzip the file
if os.path.exists(zip_file_path):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(jpm_folder)
    print(f"Extracted PDF files from {zip_file_path} to {jpm_folder}")
else:
    print(f"{zip_file_path} not found. Please check the file location.")

# Define the paths to the extracted PDF files
paths = [os.path.join(jpm_folder,f) for f in os.listdir(jpm_folder) if f.endswith(".pdf")]

1.1.2 Defining a JPMorgan PDF-to-table converterΒΆ

The following function is tailored to the format of JPMorgan transcripts by assuming that:

  1. The first page contains date on its own line
  2. The first page contains the quarter info formatted as "xQYY"
  3. Speakers are separated by "................."
  4. Everything said by the operator is preceded by "Operator: "
  5. There is a "MANAGEMENT DISCUSSION SECTION" and a "QUESTION AND ANSWER SECTION" titled in all-caps
  6. All words in speaker and company names are capitalised. The only exception is small words like "of", "at", etc.
  7. All Qs and As start with speaker information formatted like [name]\n[title, firm]\n[Q/A]

If the above assumptions are met, this function will return a dataframe with the following information:

  • uid (str) : informative unique identifier for each row
  • bank (str) : bank name (for process_jpmorgan_pdf, it is always JPMorgan)
  • year (int) : year discussed in the earnings call (for 4Q22, this is 2022, even though the earnings call took place in January 2023)
  • quarter (int) : quarter discussed in the earnings call
  • date (datetime) : date the earnings call took place
  • section (str) : section the text comes from ("management_discussion" or "questions_answers")
  • name (str) : name of the speaker
  • title (str) : job title of the speaker
  • firm (str) : firm the speaker represents
  • qa_type (str) : text type ("Q", "A", "N" for questions, answers, and neither respectively)
  • qa_num_within (int) : number of the question and answer within an earnings call; this can be used to map answers to questions within a call after processing
  • qa_num (int) : number of the question and answer across all earnings calls; this can be used to map answers to questions across calls after processing
  • qa_text (str) : question or answer text
InΒ [Β ]:
def process_jpmorgan_pdf(path, q_num_init=0):
    doc = fitz.open(path)

    print(f"This document consists of {len(doc)} pages")
    bank = "JPMorganChase"

    df = pd.DataFrame(columns =["uid", "uid_prelim", "bank", "year", "quarter", "date", "section",
                            "name", "title", "firm", "qa_type", "qa_num_within", "qa_num", "qa_text"])
    last_speaker = ""
    current_section = ""
    q_num_within = 0
    q_num = q_num_init         # count questions to be able to link As to Qs in case of separate processing
    # iterate over pages
    for page_num in range(len(doc)):
        print(f"Working on page {page_num}...")
        page = doc.load_page(page_num)                          # load a page
        text = page.get_text()                                  # extract text
        text_stripped = text.strip()
        text_lines = text_stripped.split('\n')                  # split text into a list of strings

        if page_num == 0:                                       # first page
              date = detect_date_in_list_of_strings(text_lines)   # when the report was published
              year_published = str(date.year)

              try:
                  expected_pattern = f"Q{year_published[-2:]}"
                  quarter = detect_pattern(text_lines, expected_pattern, unique=True)
                  year = int(year_published)                             # year the report refers to
              except:   # Q4 are typically released in January, so it refers to the previous year
                  expected_pattern = f"Q{int(year_published[-2:])-1}"
                  quarter = detect_pattern(text_lines, expected_pattern, unique=True)
                  year = int(year_published)-1
              if quarter.split(" "):
                  # if quarter returns several "words", get only the word with the pattern
                  quarter = detect_pattern(quarter.split(" "), expected_pattern, unique=True)
              continue

        # process the other pages
        text_sections = text_stripped.split("........................................................")
        text_sections = [item for item in text_sections if item.strip()]  # filter out empty strings

        # iterate over sections
        for section in text_sections:
            if "Operator: " in section:
                last_speaker = "Operator"
                if "QUESTION AND ANSWER" in section:
                    current_section = "questions_answers"
                continue

            if "MANAGEMENT DISCUSSION" in section:
                current_section = "management_discussion"
                continue
            elif "QUESTION AND ANSWER" in section:
                current_section = "questions_answers"
                continue

            # remove leading dots and empty lines
            section_nodots = re.sub(r'^\.+', '', section)
            lines = [item for item in section_nodots.split("\n") if item.strip()]
            if not lines:             # empty list
                continue

            # get lines were all words are capitalised (names, acronyms, "Q " or "A ")
            all_caps_lines = get_all_caps_lines(lines)

            # define desirable conditions
            bool_length = len(all_caps_lines)>=2

            # if not all_caps_lines and last_speaker != "Operator":
            if not bool_length and last_speaker != "Operator":
                # a section starts without anyone being introduced and it is not a
                # continuation of Operator's speech means that previous Q/A is continued
                # -- this will need to be merged with the preceding row afterwards
                name = title = firm = qa_type = ""
                qa_text = remove_preceding_numbers(section_nodots).replace('\n', '')

                df.at[df.index[-1], 'qa_text'] = df.iloc[-1].qa_text + " " + qa_text

            elif bool_length:
                name = all_caps_lines[0].strip()
                # print(all_caps_lines)
                title = all_caps_lines[1].split(",")[0].strip()
                firm = all_caps_lines[1].split(",")[1].strip()

                # catch cases where three all-caps phrases get picked
                # due to lines with just one capitalised word
                wrong_catch = False
                if len(all_caps_lines)>2:
                  if all_caps_lines[2].strip() not in ['Q', 'A']:
                    wrong_catch = True

                if len(all_caps_lines)==2 or wrong_catch:
                    qa_type = "N"
                    qa_id_within = np.nan
                    qa_id = np.nan
                    qa_text = section_nodots.split(firm)[-1].replace('\n', '').strip()
                else:
                    qa_type = all_caps_lines[2].strip()
                    if qa_type == "Q" and name != last_speaker:
                      q_num_within += 1
                      q_num += 1
                    qa_id_within = q_num_within
                    qa_id = q_num

                    # # if an executive answers before the analyst asks a question,
                    # # we do not want to count it as an answer to the earlier question
                    # if qa_type == "A" and last_speaker=="Operator":
                    #   qa_id_within = np.nan
                    #   qa_id = np.nan
                    # # commenting this out because sometimes exeuctives continue
                    # # providing useful information even after the operator intervenes

                    str_splitter = section_nodots.split(qa_type)[0]
                    qa_text = section_nodots.split(str_splitter)[-1][2:].replace('\n', '').strip()
                    if qa_text=="": # try to catch PyMuPDF formatting errors!
                          if current_section=='questions_answers' and lines[0].strip()==name and not lines[2].strip()==qa_type:
                                qa_text = " ".join([item for item in lines if item not in all_caps_lines])

                # define an informative unique identified
                prelim_uid = f"{bank}_{quarter}_{qa_type}_{q_num_within}"
                num_prelim_uids = (df.uid_prelim == prelim_uid).sum()
                uid = prelim_uid + f".{num_prelim_uids}"

                # check if it is not the same speaker continuing on a new page
                if name == last_speaker:
                    df.at[df.index[-1], 'qa_text'] = df.iloc[-1].qa_text + " " + qa_text
                # elif qa_id_within==0: # can happen if Q&A is started off by an executive
                #     pass
                else:
                    df_to_append = pd.DataFrame({
                    "uid": [uid],
                    "uid_prelim": [prelim_uid],
                    "bank": [bank],
                    "year": [year],
                    "quarter": [int(quarter[0])],
                    "date": [date],
                    "section": [current_section],
                    "name": [name],
                    "title": [title],
                    "firm": [firm],
                    "qa_type": [qa_type],
                    "qa_num_within" : [qa_id_within],
                    "qa_num": [qa_id],
                    "qa_text": [qa_text]
                    })

                    if not df_to_append.isna().all(axis=1).any():
                        df = pd.concat((df, df_to_append), ignore_index=True)

                last_speaker = name

            else:
                print("Excluding:\n", section_nodots)
                continue
                # raise ValueError("Failed!")

    df = df.drop(["uid_prelim"], axis=1)
    return df

1.1.3 Defining a wrapper function to batch-process all PDF files in the folderΒΆ

Since each bank will need a dedicated transcript processing function, this wrapper function asks the user to input the bank they wish to process.

All files in the supplied pathlist must belong to the same bank. Otherwise, the wrapper function returns an error.

All information from the target bank is concatenated in a single csv file called "transcripts_tabular_{bank}.csv"

InΒ [Β ]:
def transcript_pdf_to_csv(pathlist, save_folder):
  banks = ['JPMorgan']
  print("Which bank are you seeking to process?")
  bank = input(banks)
  print("Do all paths in `pathlist` lead to files from this bank?")
  confirmation = input(['yes', 'no'])
  if confirmation != 'yes':
      raise ValueError("All paths must lead to the files from the chosen bank!")

  df_all_transcripts = pd.DataFrame(columns =["uid", "bank", "year", "quarter", "date", "section",
                            "name", "title", "firm", "qa_type", "qa_num_within", "qa_num", "qa_text"])
  for path in pathlist:
    q_num_init = 0 if pd.isna(df_all_transcripts.qa_num.max()) else df_all_transcripts.qa_num.max()
    if bank == 'JPMorgan':
      df = process_jpmorgan_pdf(path, q_num_init)
      df_all_transcripts = pd.concat((df_all_transcripts, df), ignore_index=True)

  df_all_transcripts.to_csv(os.path.join(save_folder, f"transcripts_tabular_{bank}.csv"), index=False)

1.1.4 Converting all PDF files in the folder to a single csv tableΒΆ

Running the function prints all information that gets excluded. In case of JPMorgan, it excludes the disclaimers on the last page of all transcripts.

InΒ [Β ]:
transcript_pdf_to_csv(pathlist=paths, save_folder=processed_data_folder)

1.1.5 OutputΒΆ

InΒ [Β ]:
df_jpm = pd.read_csv(processed_data_folder + "/transcripts_tabular_JPMorgan.csv")
df_jpm.head()
Out[Β ]:
uid bank year quarter date section name title firm qa_type qa_num_within qa_num qa_text
0 JPMorganChase_3Q24_N_0.0 JPMorganChase 2024 3 2024-10-11 management_discussion Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. N NaN NaN Thank you and good morning, everyone. Starting...
1 JPMorganChase_3Q24_Q_1.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jim Mitchell Analyst Seaport Global Securities LLC Q 1.0 1.0 Hey, good morning. So, Jeremy, as you highligh...
2 JPMorganChase_3Q24_A_1.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. A 1.0 1.0 Yeah. Sure, Jim. I'll try to answer both quest...
3 JPMorganChase_3Q24_Q_2.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jim Mitchell Analyst Seaport Global Securities LLC Q 2.0 2.0 All right. Thanks a lot.
4 JPMorganChase_3Q24_A_2.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. A 2.0 2.0 Thanks.

1.1.6 Checking the dataΒΆ

InΒ [Β ]:
def check_dataset(df):
  print('########## Confirm that all text rows contain text ##########')
  print(f"There are {df.qa_text.isna().sum()} empty text rows")
  print('\n')

  print('########## Check whether categorical rows have the expected value range ##########')
  categoricals = ['bank', 'section', 'name', 'title', 'firm', 'qa_type']
  for cat in categoricals:
    print(f"{cat}: {df[cat].unique()}")
  print('\n')

  print('########## Check that numerical columns have the expected value range ##########')
  display(df.describe())
  print('\n')

  print('########## Confirm that all uids are unique ##########')
  print(f"There are {df.uid.unique().shape[0]} unique uids and {df.shape[0]} rows in the dataset")
  print('\n')

check_dataset(df_jpm)
########## Confirm that all text rows contain text ##########
There are 0 empty text rows


########## Check whether categorical rows have the expected value range ##########
bank: ['JPMorganChase']
section: ['management_discussion' 'questions_answers']
name: ['Jeremy Barnum' 'Jim Mitchell' 'Steven Chubak' 'Jamie Dimon'
 'Erika Najarian' 'Glenn Schorr' 'Gerard Cassidy' "Matt O'Connor"
 'Mike Mayo' 'Ebrahim H. Poonawala' 'Betsy L. Graseck' 'Saul Martinez'
 'Ken Usdin' 'John McDonald' 'Matt O’Connor' 'Charles W. Peabody'
 'Manan Gosalia' 'Ryan Kenny' 'John E. McDonald' 'Andrew Lim'
 'James Mitchell' 'Ebrahim Poonawala' 'Charles Peabody' 'Kenneth M. Usdin']
title: ['Chief Financial Officer' 'Analyst' 'Chairman & Chief Executive Officer'
 'Chief Financial Officer & Member-Operating Committee']
firm: ['JPMorgan Chase & Co.' 'Seaport Global Securities LLC'
 'Wolfe Research LLC' 'UBS Securities LLC' 'Evercore ISI'
 'RBC Capital Markets LLC' 'Deutsche Bank Securities'
 'Wells Fargo Securities LLC' 'Bank of America Merrill Lynch'
 'Morgan Stanley & Co. LLC' 'HSBC Securities (USA)' 'Jefferies LLC'
 'Autonomous Research' 'Portales Partners LLC' 'SociΓ©tΓ© GΓ©nΓ©rale'
 'Seaport Global Securities' 'Portales Partners' 'Jefferies & Company'
 'BofA Securities' 'UBS' 'Seaport Research Partners'
 'SociΓ©tΓ© GΓ©nΓ©rale SA (UK)']
qa_type: ['N' 'Q' 'A']


########## Check that numerical columns have the expected value range ##########
year quarter qa_num_within qa_num
count 1273.000000 1273.000000 1242.000000 1242.000000
mean 2022.359780 2.456402 21.553140 292.210950
std 1.042049 1.043786 12.453459 165.198839
min 2021.000000 1.000000 0.000000 1.000000
25% 2021.000000 2.000000 11.000000 152.000000
50% 2022.000000 2.000000 21.000000 298.500000
75% 2023.000000 3.000000 31.000000 437.750000
max 2024.000000 4.000000 49.000000 564.000000

########## Confirm that all uids are unique ##########
There are 1273 unique uids and 1273 rows in the dataset


1.2 Data CleaningΒΆ

In the transcripts, there are situations where interruptions occured. This is reflected as '...' in the texts. Furthermore, at the end of some of the analysts questions, there are brief conversations of greetings. These do not provide any information and should be removed

InΒ [Β ]:
# load the csv from the previous step
data = pd.read_csv(processed_data_folder + "/transcripts_tabular_JPMorgan.csv")

1.2.0 Helper functions / Load dataΒΆ

InΒ [Β ]:
def print_row(row, uids):
  if row['uid'] in uids:
    print(f"\033[31m{row['qa_type']} ({row['name']}): {row['qa_text']}\033[0m")
  else:
    print(f"{row['qa_type']} ({row['name']}): {row['qa_text']}")


def print_multi_question(data, qa_num, uids):
  data_qa = data[data['qa_num'] == qa_num]
  data_qa.apply(lambda x: print_row(x, uids), axis=1)

def print_single_question(data, qa_num, uid):
  data_qa = data[data['qa_num'].isin([qa_num, qa_num-1, qa_num+1])]
  data_qa.apply(lambda x: print_row(x, [uid]), axis=1)


def get_action_for_multi(uids):
  print_multi_question(data, num, uids)
  action = input("Do you want to consider them together?\n")
  return action

def get_action_for_single(uid):
  pass


def split_string_by_punctuation(text):
  punct_regex = r"(?=\S)(?:i.e.|J.P.|U.S.|ex.|[A-Z][a-z]{0,3}\.|[^.?!]|\.(?!\s+[A-Z]))*.?"
  return re.findall(punct_regex, text)


def remove_sentence_with_three_dots(text):
  sentences = split_string_by_punctuation(text)
  if text.startswith('...') or text.startswith('…'):
    sentences = sentences[1:]
  if text.endswith('...') or text.endswith('…'):
    sentences = sentences[:-1]
  return ' '.join(sentences)

1.2.1 Dealing with interruptionsΒΆ

Identify where interruptions occurs in transcripts by searching for '…' or '...'.

Then go through each instance, ask whether to delete the entire row ('delete all'), delete the sentence containing the triple dots ('delete part') or keep it.

If there are two instances identified in within the same Q/A, there's an extra option of merging the two rows and delete everything in between ('merge'), corresponds to the case where an unimportant interruption happened during someone's speech.

InΒ [Β ]:
def clean_interruptions(df):
  # identify interruptions by search for triple dots
  df['qa_text'] = df['qa_text'].astype(str)
  df['interruption'] = df.apply(lambda x: x['qa_text'].find('...')!=-1 or x['qa_text'].find('…')!=-1, axis=1)

  interruption_num = df[df['interruption'] == True]['qa_num'].unique()
  # create a dictionary of {qa_num: [list of uids]}
  interruption_dict = {num: df[(df['interruption'] == True) & (df['qa_num'] == num)]['uid'].tolist() for num in interruption_num}

  # create an interface to display interruptions and ask for actions
  out = Output(layout={'width': '50em'})
  display(out)

  with out:
    for num in interruption_num:
      uids = interruption_dict[num]
      out.clear_output()

      # deal with the special case of 2 interruptions within one Q/A
      if len(uids) == 2:
        print_multi_question(df, num, uids)
        action = input("Do you want to consider them together? \n")
        if action.lower() == "yes":
          action = input("What's your action? \nType 'merge' to combine the two rows and delete anything in between. \nType 'keep' to do nothing. \n")
          if action == 'merge':
            # append the text from the second row to the first row
            df.loc[df['uid'] == uids[0], 'qa_text'] += ' ' + df.loc[df['uid'] == uids[1], 'qa_text'].values
            # delete rows in between
            start_index = df.index[df['uid'] == uids[0]][0]
            end_index = df.index[df['uid'] == uids[1]][0]
            index_to_delete = range(start_index+1, end_index+1)
            df = df.drop(index_to_delete).reset_index(drop=True)
        else:
          for uid in uids:
            out.clear_output()
            print_single_question(df, num, uid)
            action = input("What's your action? \nType 'delete all' to delete the entire row. \nType 'delete part' to remove the sentence containing the triple dots. \nType 'keep' to do nothing. \n")

            if action == 'delete all':
              df = df[df['uid'] != uid]
            elif action == 'delete part':
              df.loc[df['uid'] == uid, "qa_text"] = df.loc[df['uid'] == uid, "qa_text"].apply(remove_sentence_with_three_dots)
      else:
        for uid in uids:
          out.clear_output()
          print_single_question(df, num, uid)
          action = input("What's your action? \nType 'delete all' to delete the entire row. \nType 'delete part' to remove the sentence containing the triple dots. \nType 'keep' to do nothing. \n")

          if action == 'delete all':
            df = df[df['uid'] != uid]
          elif action == 'delete part':
            df.loc[df['uid'] == uid, "qa_text"] = df.loc[df['uid'] == uid, "qa_text"].apply(remove_sentence_with_three_dots)
  df.drop(columns=['interruption'], inplace=True)
  return df

1.2.2 Deal with short textsΒΆ

Identify rows that are short (word count <= 5). A large amount of these will be greetings recorded as messages.

The program goes through each instance, and ask whether to delete or keep the row.

InΒ [Β ]:
def clean_short(df):
  # identify short texts that contain at most 5 words
  df['short'] = df.apply(lambda x: len(x['qa_text'].split(' ')) <= 5, axis=1)

  short_num = df[df['short'] == True]['qa_num'].unique()
  # create a dictionary of {qa_num: [list of uids]}
  short_dict = {num: df[(df['short'] == True) & (df['qa_num'] == num)]['uid'].tolist() for num in short_num}

  out = Output(layout={'width': '50em'})
  display(out)

  with out:
    for num in short_num:
      uids = short_dict[num]
      out.clear_output()
      for uid in uids:
        out.clear_output()
        print_single_question(df, num, uid)
        action = input("Do you want to delete this line? \nType 'yes' or 'no'. \n")

        if action=='yes':
          df = df[df['uid'] != uid]

  df.drop(columns=['short'], inplace=True)
  return df

1.2.3 Deal with hanging Q or AΒΆ

Check for any hanging Q or A (Question or Answer that without their counterpart).

Then the program goes through each instance, ask whether to delete the row or merge to the previous Q/A. In the case of a lone answer, it will be appended to the end of the previous Q/A as another answer. In the case of a lone question, it will be appended to the start of the next Q/A.

InΒ [Β ]:
def clean_hanging(df):
  df.reset_index(drop=True, inplace=True)

  # getting all unique qa_num from the dataframe, then remove nan
  qa_num = pd.unique(df['qa_num'])
  qa_num = qa_num[~pd.isnull(qa_num)]

  # loop through each question number, if there are less than 2 rows with such question number,
  # then we have a hanging Q/A
  hanging_num = []
  for num in qa_num:
    if df[df['qa_num'] == num].shape[0] < 2:
      hanging_num.append(num)
  hanging_dict = {num: df[df['qa_num'] == num]['uid'].tolist() for num in hanging_num}

  out = Output(layout={'width': '50em'})
  display(out)

  with out:
    for num in hanging_num:
      for uid in hanging_dict[num]:
        out.clear_output()
        print_single_question(df, num, uid)
        action = input("Do you want to merge or delete? \nType 'merge' to merge to the previous/next A/Q. \nType 'delete' to remove the row.\n")

        if action == 'merge':
          if df[df['uid'] == uid]['qa_type'].values[0] == 'Q':
            # append to the next question
            index = df[df['uid'] == uid].index[0]
            next_qa_num = df.loc[index+1, 'qa_num']
            next_qa_num_within = df.loc[index+1, 'qa_num_within']
            df.loc[index, 'qa_num'] = next_qa_num
            df.loc[index, 'qa_num_within'] = next_qa_num_within
          elif df[df['uid'] == uid]['qa_type'].values[0] == 'A':
            # append to the previous answer
            index = df[df['uid'] == uid].index[0]
            prev_qa_num = df.loc[index-1, 'qa_num']
            prev_qa_num_within = df.loc[index-1, 'qa_num_within']
            df.loc[index, 'qa_num'] = prev_qa_num
            df.loc[index, 'qa_num_within'] = prev_qa_num_within
        elif action == 'delete':
          df = df[df['uid'] != uid]

  return df

1.2.4 Update the uid to reflect changes in question numberΒΆ

InΒ [Β ]:
def update_uid(df):
  df.loc[~data['qa_num'].isna(), "uid"] = (
      df[~data['qa_num'].isna()]
      .assign(uid=lambda x: x.groupby('qa_num').cumcount() + 1)
      .assign(uid=lambda x: [f"{bank}_{int(quarter)}Q{int(year-2000)}_{qa_type}_{uid}.0" for bank,quarter,year,qa_type,uid in zip(x['bank'], x['quarter'], x['year'], x['qa_type'], x['uid'])])
  )
  return df

1.2.5 Running the cleaning process and save the cleaned dataΒΆ

InΒ [Β ]:
data = clean_interruptions(data)
InΒ [Β ]:
data = clean_short(data)
InΒ [Β ]:
data = clean_hanging(data)
InΒ [Β ]:
data = update_uid(data)
# save the data
data.to_csv(processed_data_folder + "/transcripts_tabular_JPMorgan_clean.csv", index=False)

1.2.6 Test on subset of dataset (if needed)ΒΆ

InΒ [Β ]:
# load the uncleaned dataset, select the first 146 rows (corresopnd to the latest 2 quarters)
test_data = pd.read_csv(processed_data_folder+"/transcripts_tabular_JPMorgan.csv")
test_data = test_data.head(146)
# manually add a row to showcase merging in same question
new_row = {'uid':'Extra', 'qa_type':'A', 'qa_num_within':4, 'qa_num':4, 'qa_text':'Random Insert' }
insert_index = 10
test_data = pd.concat([test_data.iloc[:insert_index], pd.DataFrame([new_row]), test_data.iloc[insert_index:]]).reset_index(drop=True)
InΒ [Β ]:
test_data = clean_interruptions(test_data)
InΒ [Β ]:
test_data = clean_short(test_data)
InΒ [Β ]:
test_data = clean_hanging(test_data)
InΒ [Β ]:
test_data = update_uid(test_data)
InΒ [Β ]:
test_data

1.3 Initial Exploratory Data AnalysisΒΆ

1.3.1 Transcripts overviewΒΆ

InΒ [Β ]:
# Defining a function to preprocess text
def preprocess_text(text):

    # Checking if the input is a string
    if not isinstance(text, str):
        return ""

    # Converting the text to lowercase
    text = text.lower()

    # Removing punctuation from the text (including special quotation marks and apostrophes)
    text = re.sub(r"[^\w\s]", '', text)

    # Removing numbers from the text
    text = re.sub(r'\d+', '', text)

    # Tokenizing the text
    tokens = word_tokenize(text)

    # Removing stopwords from the text
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatizing the words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens
InΒ [Β ]:
# load the cleaned dataset
transcripts_df = pd.resad_csv(processed_data_folder + "/transcripts_tabular_JPMorgan_clean.csv")

# Preprocessing the 'qa_text' column in the transcripts data
transcripts_df['qa_text_processed'] = transcripts_df['qa_text'].apply(preprocess_text)

# Viewing the preprocessed data
transcripts_df[['qa_text', 'qa_text_processed']].head()
Out[Β ]:
qa_text qa_text_processed
0 Thank you and good morning, everyone. Starting... [thank, good, morning, everyone, starting, pag...
1 Hey, good morning. So, Jeremy, as you highligh... [hey, good, morning, jeremy, highlighted, full...
2 Yeah. Sure, Jim. I'll try to answer both quest... [yeah, sure, jim, ill, try, answer, question, ...
3 Hi. Good morning. So Jeremy, how are you? So I... [hi, good, morning, jeremy, want, ask, expense...
4 Sure. So good question and I agree with your n... [sure, good, question, agree, number, agree, w...
InΒ [Β ]:
# Pulling all words from the 'qa_text_processed' column into a list
transcripts_all_words = [word for tokens in transcripts_df['qa_text_processed'] for word in tokens]

# Calculating the frequency distribution of the words
transcripts_freq_dist = FreqDist(transcripts_all_words)
InΒ [Β ]:
# Getting the top 10 words and their frequencies
transcripts_top_10 = transcripts_freq_dist.most_common(10)
transcripts_words, transcripts_counts = zip(*transcripts_top_10)

# Plotting the top 10 words in a barplot
plt.figure(figsize=(12, 6))
sns.barplot(x=list(transcripts_words), y=list(transcripts_counts), palette="viridis")
plt.title('Top 10 Words in the Transcripts')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

image.png

InΒ [Β ]:
# Combining the tokens from the 'qa_text_processed' column
transcripts_text = ' '.join([' '.join(tokens) for tokens in transcripts_df['qa_text_processed']])

# Generating the word cloud for the text
transcripts_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(transcripts_text)

# Plotting the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(transcripts_wordcloud, interpolation='bilinear')
plt.title('Word Cloud for the Transcripts')
plt.axis('off')
plt.show()
No description has been provided for this image

Many of the words that appear are not hugely informative. As such, I will preprocess the text again removing extra words.

InΒ [Β ]:
# Defining a function to preprocess text
def preprocess_text(text):
    # Checking if the input is a string
    if not isinstance(text, str):
        return ""

    # Converting the text to lowercase
    text = text.lower()

    # Removing punctuation from the text (including special quotation marks and apostrophes)
    text = re.sub(r"[^\w\s]", '', text)

    # Removing numbers from the text
    text = re.sub(r'\d+', '', text)

    # Tokenizing the text
    tokens = word_tokenize(text)

    # Defining stopwords and adding custom words
    stop_words = set(stopwords.words('english'))
    custom_stop_words = {"think", "going", "thats", "like", "bit", "thing", "yeah",
                         "see", "would", "youre", "question", "could", "dont",
                         "stuff", "jeremy", "lot", "betsy", "teresa", "michael",
                         "hi", "hey", "hello", "maybe", "jamie", "go", "weve"}
    stop_words.update(custom_stop_words)

    # Removing stopwords from the text
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatizing the words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens
InΒ [Β ]:
# Preprocessing the 'qa_text' column in the transcripts data
transcripts_df['qa_text_processed'] = transcripts_df['qa_text'].apply(preprocess_text)

# Viewing the preprocessed data
transcripts_df[['qa_text', 'qa_text_processed']].head()
Out[Β ]:
qa_text qa_text_processed
0 Thank you and good morning, everyone. Starting... [thank, good, morning, everyone, starting, pag...
1 Hey, good morning. So, Jeremy, as you highligh... [good, morning, highlighted, full, year, nii, ...
2 Yeah. Sure, Jim. I'll try to answer both quest... [sure, jim, ill, try, answer, question, togeth...
3 Hi. Good morning. So Jeremy, how are you? So I... [good, morning, want, ask, expense, light, com...
4 Sure. So good question and I agree with your n... [sure, good, agree, number, agree, way, youve,...
InΒ [Β ]:
# Pulling all words from the 'qa_text_processed' column into a list
transcripts_all_words = [word for tokens in transcripts_df['qa_text_processed'] for word in tokens]

# Calculating the frequency distribution of the words
transcripts_freq_dist = FreqDist(transcripts_all_words)
InΒ [Β ]:
# Getting the top 10 words and their frequencies
transcripts_top_10 = transcripts_freq_dist.most_common(10)
transcripts_words, transcripts_counts = zip(*transcripts_top_10)

# Plotting the top 10 words in a barplot
plt.figure(figsize=(12, 6))
sns.barplot(x=list(transcripts_words), y=list(transcripts_counts), palette="viridis")
plt.title('Top 10 Words in the Transcripts')
plt.xlabel('Words')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

image.png

InΒ [Β ]:
# Combining the tokens from the 'qa_text_processed' column
transcripts_text = ' '.join([' '.join(tokens) for tokens in transcripts_df['qa_text_processed']])

# Generating the word cloud for the text
transcripts_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(transcripts_text)

# Plotting the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(transcripts_wordcloud, interpolation='bilinear')
plt.title('Word Cloud for the Transcripts')
plt.axis('off')
plt.show()
No description has been provided for this image

1.3.2 1Q22 and 2Q24ΒΆ

Now, let's identify two intersting quarters to investigate based on the financial metrics.

InΒ [Β ]:
# Getting the unique metric types
unique_metric_types = metrics_df['metric_type'].unique()

# Defining colors for each plot
colors = ['blue', 'green', 'red', 'purple']

# Plotting in a 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(18, 9))

# Flattening axes array for easy iteration
axes = axes.flatten()

# Looping through metric types and corresponding axes
for i, metric in enumerate(unique_metric_types):
    subset = metrics_df[metrics_df['metric_type'] == metric]
    axes[i].plot(subset['Q&FY'], subset['metric_value'], marker='o', color=colors[i])
    axes[i].set_title(f'Metric: {metric}', fontsize=16, color=colors[i])
    axes[i].set_xlabel('Quarter', fontsize=14)
    axes[i].set_ylabel('Metric Value', fontsize=14)
    axes[i].tick_params(axis='x', labelsize=12, rotation=45)
    axes[i].tick_params(axis='y', labelsize=12)
    axes[i].grid(visible=True, linestyle='--', alpha=0.5)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()
No description has been provided for this image

Picking out two interesting dates to explore based on the charts above of key financial metrics:

1Q22

In 1Q22, the bank faced significant financial challenges, marked by a low CET1 capital ratio and declining net income and EPS. This suggests reduced profitability and shareholder returns. The increase in provisions for credit losses suggests heightened caution regarding potential loan defaults. Overall, this quarter indicates a period of financial strain and risk management.

2Q24

In 2Q24, the bank reached peak performance across several key metrics, with its CET1 capital ratio, net income, and EPS all at their highest levels. This suggests a strong capital position, robust profitability, and high returns to shareholders. Provisions for credit losses, however, remained elevated. This indicates a continued cautious approach toward potential credit risks. Overall, this quarter indicates a period of strong financial results although the bank is maintaining caution to guard against possible economic uncertainties or loan defaults.

InΒ [Β ]:
# Converting the date column to datetime
transcripts_df['date'] = pd.to_datetime(transcripts_df['date'])
InΒ [Β ]:
# Creating seperate dataframes for the Apr-22 and Jul-24 Q&As
apr_22 = transcripts_df[(transcripts_df['date'] >= '2022-04-01') & (transcripts_df['date'] < '2022-05-01')]
jul_24 = transcripts_df[(transcripts_df['date'] >= '2024-07-01') & (transcripts_df['date'] < '2024-08-01')]

# Pulling all words from qa_text_processed column into a list
apr_22_all_words = [word for tokens in apr_22['qa_text_processed'] for word in tokens]
jul_24_all_words = [word for tokens in jul_24['qa_text_processed'] for word in tokens]

# Calculating the frequency distribution of the words from each dataset
apr_22_freq_dist = FreqDist(apr_22_all_words)
jul_24_freq_dist = FreqDist(jul_24_all_words)
InΒ [Β ]:
# Getting the top 10 words and their frequencies from the Apr-22 Q&As
apr_22_top_10 = apr_22_freq_dist.most_common(10)
apr_22_words, apr_22_counts = zip(*apr_22_top_10)

# Getting the top 10 words and their frequencies from the Jul-24 Q&As
jul_24_top_10 = jul_24_freq_dist.most_common(10)
jul_24_words, jul_24_counts = zip(*jul_24_top_10)
InΒ [Β ]:
# Creating a figure with 1 row and 2 columns of subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

# Plotting the top 10 words in a barplot for the Apr-22 Q&As
sns.barplot(x=list(apr_22_words), y=list(apr_22_counts), palette="viridis", ax=ax1)
ax1.set_title('Top 10 Words in the April 2022 Q&As')
ax1.set_xlabel('Words')
ax1.set_ylabel('Frequency')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)

# Plotting the top 10 words in a barplot for the Jul-24 Q&As
sns.barplot(x=list(jul_24_words), y=list(jul_24_counts), palette="viridis", ax=ax2)
ax2.set_title('Top 10 Words in the July 2024 Q&As')
ax2.set_xlabel('Words')
ax2.set_ylabel('Frequency')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()

image.png

InΒ [Β ]:
# Creating a dataframe excluding Apr-22 Q&As
not_apr_22 = transcripts_df[(transcripts_df['date'] < '2022-04-01') | (transcripts_df['date'] >= '2022-05-01')]

# Creating a dataframe excluding Jul-24 Q&As
not_jul_24 = transcripts_df[(transcripts_df['date'] < '2024-07-01') | (transcripts_df['date'] >= '2024-08-01')]
InΒ [Β ]:
# Getting unique words for the Apr-22 Q&As
apr_22_unique_words = set(apr_22_freq_dist.keys()) - set(not_apr_22['qa_text_processed'].explode())
apr_22_unique_counts = {word: apr_22_freq_dist[word] for word in apr_22_unique_words}
apr_22_unique_df = pd.DataFrame(apr_22_unique_counts.items(), columns=['word', 'apr_22_frequency'])

# Getting unique words for the Jul-24 Q&As
jul_24_unique_words = set(jul_24_freq_dist.keys()) - set(not_jul_24['qa_text_processed'].explode())
jul_24_unique_counts = {word: jul_24_freq_dist[word] for word in jul_24_unique_words}
jul_24_unique_df = pd.DataFrame(jul_24_unique_counts.items(), columns=['word', 'jul_24_frequency'])
InΒ [Β ]:
# Getting the top 10 unique words for Apr-22 and Jul-24
apr_22_top_words = apr_22_unique_df.nlargest(10, 'apr_22_frequency')
jul_24_top_words = jul_24_unique_df.nlargest(10, 'jul_24_frequency')
InΒ [Β ]:
# Setting up the plot
plt.figure(figsize=(14, 6))

# Creating a bar plot for the top 10 Apr-22 exclusive words
plt.subplot(1, 2, 1)
sns.barplot(data=apr_22_top_words, x='word', y='apr_22_frequency', palette='Blues')
plt.title('Top 10 Unique Words Used in the April 2022 Q&As')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)

# Creating a bar plot for the top 10 Jul-24 exclusive words
plt.subplot(1, 2, 2)
sns.barplot(data=jul_24_top_words, x='word', y='jul_24_frequency', palette='Oranges')
plt.title('Top 10 Unique Words Used in the July 2024 Q&As')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()

image.png

In 1Q22, the term β€œrussia-associated,” is likely tied to the Russia-Ukraine conflict. Similarly, the word β€œnickel” may indicate exposure to commodity price volatility, possibly linked to the Russia-Ukraine conflict.

In 2Q24, words like β€œindex” and β€œgovernor” indicate attention to macroeconomic indicators and regulatory guidance.

1.3.3 Emerging risksΒΆ

To exlpore emerging risks, let's focus on the last two quarters.

InΒ [Β ]:
# Getting the unique metric types
unique_metric_types = metrics_df['metric_type'].unique()

# Defining colors for each plot
colors = ['blue', 'green', 'red', 'purple']

# Plotting in a 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(18, 9))

# Flattening axes array for easy iteration
axes = axes.flatten()

# Looping through metric types and corresponding axes
for i, metric in enumerate(unique_metric_types):
    subset = metrics_df[metrics_df['metric_type'] == metric]
    axes[i].plot(subset['Q&FY'], subset['metric_value'], marker='o', color=colors[i])
    axes[i].set_title(f'Metric: {metric}', fontsize=16, color=colors[i])
    axes[i].set_xlabel('Quarter', fontsize=14)
    axes[i].set_ylabel('Metric Value', fontsize=14)
    axes[i].tick_params(axis='x', labelsize=12, rotation=45)
    axes[i].tick_params(axis='y', labelsize=12)
    axes[i].grid(visible=True, linestyle='--', alpha=0.5)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()
No description has been provided for this image
InΒ [Β ]:
# Creating a dataframs for the 2Q24 and 3Q24 Q&As
Q23_FY24 = transcripts_df[(transcripts_df['date'] >= '2024-07-01') & (transcripts_df['date'] < '2024-11-01')]

# Pulling all words from qa_text_processed column into a list
Q23_FY24_all_words = [word for tokens in Q23_FY24['qa_text_processed'] for word in tokens]

# Calculating the frequency distribution of the words from the dataset
Q23_FY24_freq_dist = FreqDist(Q23_FY24_all_words)
InΒ [Β ]:
# Getting the top 10 words and their frequencies from the 2Q24 and 3Q24 Q&As
Q23_FY24_top_10 = Q23_FY24_freq_dist.most_common(10)
Q23_FY24_words, Q23_FY24_counts = zip(*Q23_FY24_top_10)
InΒ [Β ]:
# Setting up the plot
fig, ax1 = plt.subplots(figsize=(14, 6))

# Plotting the top 10 words in a barplot for the 2Q24 and 3Q24 Q&As
sns.barplot(x=list(Q23_FY24_words), y=list(Q23_FY24_counts), palette="viridis", ax=ax1)
ax1.set_title('Top 10 Words in the 2Q24 and 3Q24 Q&As')
ax1.set_xlabel('Words')
ax1.set_ylabel('Frequency')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()

image.png

InΒ [Β ]:
# Creating a dataframe excluding 2Q24 and 3Q24 Q&As
not_Q23_FY24 = transcripts_df[(transcripts_df['date'] < '2024-07-01') | (transcripts_df['date'] >= '2024-11-01')]
InΒ [Β ]:
# Getting unique words for the 2Q24 and 3Q24 Q&As
Q23_FY24_unique_words = set(Q23_FY24_freq_dist.keys()) - set(not_Q23_FY24['qa_text_processed'].explode())
Q23_FY24_unique_counts = {word: Q23_FY24_freq_dist[word] for word in Q23_FY24_unique_words}
Q23_FY24_unique_df = pd.DataFrame(Q23_FY24_unique_counts.items(), columns=['word', 'Q23_FY24_frequency'])
InΒ [Β ]:
# Getting the top 10 unique words for the 2Q24 and 3Q24 Q&As
Q23_FY24_top_words = Q23_FY24_unique_df.nlargest(10, 'Q23_FY24_frequency')
InΒ [Β ]:
# Setting up the plot
fig, ax1 = plt.subplots(figsize=(14, 6))

# Creating a bar plot for the top 10 2Q24 and 3Q24 exclusive words
sns.barplot(data=Q23_FY24_top_words, x='word', y='Q23_FY24_frequency', palette='Blues')
plt.title('Top 10 Unique Words Used in the 2Q24 and 3Q24 Q&As')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()

image.png

Words like β€œspike” and β€œtrough” could reflect the recent fluctuations in net income and EPS.

InΒ [Β ]:
# Calculating word frequencies and total word count in 2Q24 and 3Q24 Q&As
Q23_FY24_freq = transcripts_df[(transcripts_df['date'] >= '2024-07-01') & (transcripts_df['date'] < '2024-11-01')]
Q23_FY24_word_counts = Q23_FY24_freq['qa_text_processed'].explode().value_counts()
Q23_FY24_total_words = Q23_FY24_word_counts.sum()  # Total word count for 2Q24 & 3Q24

# Calculating word frequencies and total word count in other quarters
not_Q23_FY24_word_counts = not_Q23_FY24['qa_text_processed'].explode().value_counts()
not_Q23_FY24_total_words = not_Q23_FY24_word_counts.sum()  # Total word count for other quarters
InΒ [Β ]:
# Creating a DataFrame comparing relative frequencies
word_comparison_df = pd.DataFrame({
    'Q23_FY24_proportion': Q23_FY24_word_counts / Q23_FY24_total_words,
    'other_quarters_proportion': not_Q23_FY24_word_counts / not_Q23_FY24_total_words
}).fillna(0)
InΒ [Β ]:
# Adding a column for proportion difference
word_comparison_df['proportion_difference'] = word_comparison_df['Q23_FY24_proportion'] - word_comparison_df['other_quarters_proportion']
InΒ [Β ]:
# Filtering for words that are relatively more frequent in 2Q24 and 3Q24
higher_in_Q23_FY24 = word_comparison_df[word_comparison_df['proportion_difference'] > 0]
InΒ [Β ]:
# Selecting the top 10 words with the highest proportion difference
top_higher_words = higher_in_Q23_FY24.nlargest(10, 'proportion_difference').reset_index().rename(columns={'qa_text_processed': 'word'})
InΒ [Β ]:
# Plotting the top words by proportional difference
fig, ax = plt.subplots(figsize=(14, 6))
sns.barplot(data=top_higher_words, x='word', y='proportion_difference', palette='Purples')
plt.title('Top 10 Words with Higher Proportion in 2Q24 and 3Q24 Compared to Other Quarters')
plt.xlabel('Words')
plt.ylabel('Proportional Difference')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

image.png

A higher usage of "capital" could relate to how the bank has maintained a high CET1 capital ratio, indicating a strong capital position.

A higher usage of "nii" (net interest income) could relate to the recent fluctuations in net income and EPS.

2 Selecting Models and Scalability AssessmentΒΆ

2.0 Creation of evaluation datasetΒΆ

To evaluate model performance and facilitate in the decision of choosing the best model for further analysis, we create two datasets:

  • Synthetic Data: Text data with predefined sentiment, topic, and evasion labels was generated by GPT-4 and extensively refined manually to ensure accuracy. This dataset allowed us to test the models against controlled baselines and evaluate their precision in tasks with known outcomes. This file is created manually and uploaded in the clean data folder.

  • Ground Truth Data: Two transcripts’ worth of texts were randomly sampled and manually annotated for sentiment, topic, and question evasion status. This dataset provided a benchmark for assessing the models' real-world performance. The code selected random rows to form a ground truth dataset, saved as an .xlsx file. We then manually labelled them.

2.0.1 Selecting ground truth dataΒΆ

InΒ [Β ]:
def create_ground_truth(folder_path, output_folder_path):
    # define the bank of interest
    banks = ['JPMorgan']
    print("Which bank's data do you wish to label?")
    bank = input(banks)

    label_types = ['sentiment', 'topic', 'Q_outcome']

    # load ground truth data (if it exists), else create the file

    print("Creating a new ground truth dataset...")
    transcript_csv_path = os.path.join(folder_path, f"/transcripts_tabular_{bank}_clean.csv")
    transcripts_all = pd.read_csv(transcript_csv_path)

    transcripts_qa = transcripts_all[transcripts_all.section=='questions_answers'].reset_index()
    
    # consider only the qa_num where there are both Q and A (i.e. at least two rows per qa_num)
    qa_nums_all = transcripts_qa.qa_num.value_counts().loc[lambda x: x > 1].index.unique()

    median_num_Q = int(transcripts_all.qa_num_within.median())
    print(f"Median number of Q&As per transcript: {median_num_Q}")

    # how many transcripts should the Qs represent?
    NUM_TRANSCRIPTS = 2
    SEED = 42

    # sample unique Q&As
    random.seed(SEED)
    qa_num_sample = random.sample(list(qa_nums_all), int(NUM_TRANSCRIPTS*median_num_Q))
    ground_truth = transcripts_all[transcripts_all.qa_num.isin(qa_num_sample)]
    ground_truth = ground_truth[['uid', 'qa_type', 'qa_num', 'qa_text']]

    for col in label_types:
      ground_truth[f"true_{col}"] = np.nan
    ground_truth_path = os.path.join(output_folder_path, f"/ground_truth_{bank}_manual.xlsx")
    ground_truth.to_excel(ground_truth_path, index=False)
InΒ [Β ]:
create_ground_truth(processed_data_folder, output_data_folder)

The file is saved as an Excel file in the output_data. We then manually labelled the FinBERT topic, sentiment and evasion for each text on Google Sheet, where ChatGPT faciliated in the decision making. This altered (human-annotated) ground truth file was then saved in the processed_data folder and used for all subsequent ground truth analyses

2.1 Phi 3.5 for summarisationΒΆ

SECTION TAKES UP TO 32GB RAM TO RUN - A100 GPU

2.1.0 Phi 3.5 initialisationΒΆ

InΒ [Β ]:
# Initialise the pipeline - note T4 GPU does not contain enough RAM, so use CPU and ignore warning OR run on A100 GPU
pipe = pipeline("text-generation", model="microsoft/Phi-3.5-mini-instruct", trust_remote_code=True, device=0)

2.1.1 Summarisation function using Phi 3.5ΒΆ

The prompt we are using is

Summarise the following text in a consistent, concise format. Limit to 1-2 sentences focusing only on main financial themes,metrics, and indicators relevant to financial performance. Avoid interpretations, bullet points, and variable formatting.Do not allow style drift in the answers. \n\nText:\n{text}\n\nSummary:

where {text} is the text from each question or answer.

InΒ [Β ]:
# Summarisation function
def phi_summarise(input_df, input_col, batch_size=8):
  """
  Function to Summarise text when given a pre-processed Q&A table as input
    - Focusses on financial themes
    - Summarises in 1-2 sentences
  """

  start_time = time.time()
  total_count = len(input_df)
  x = 0

  # Initialise a new column for summarised text
  input_df['summarised_text'] = ""

    # Process the DataFrame in batches
  for start in range(0, total_count, batch_size):
      end = min(start + batch_size, total_count)
      batch_texts = input_df[input_col][start:end].tolist()

      # Define prompt for each text in the batch
      prompt = [
          (
              f"Summarise the following text in a consistent, concise format. Limit to 1-2 sentences focusing only on main financial themes, "
              f"metrics, and indicators relevant to financial performance. Avoid interpretations, bullet points, and variable formatting. "
              f"Do not allow style drift in the answers. \n\nText:\n{text}\n\nSummary:"
          )
          for text in batch_texts
      ]

      print(f"Processing batch {start // batch_size + 1}/{(total_count + batch_size - 1) // batch_size}")

      # Run the model on the batch prompt
      batch_summaries = pipe(prompt, max_new_tokens=75, do_sample=False)

      # Extract summaries and clean up the text
      for i, summary in enumerate(batch_summaries):
          generated_text = summary[0]['generated_text'].replace(prompt[i], "").replace("Text:\n", "").replace("Summary:", "").strip()
          first_line = generated_text.split('\n', 1)[0].strip()
          cleaned_text = re.sub(r"#[-β€’]\s*", "", first_line).strip()

          # Update Input DataFrame with the cleaned summary
          input_df.at[start + i, 'summarised_text'] = cleaned_text
          x += 1

  # Calculate time for full dataset
  end_time = time.time()
  time_taken = end_time - start_time

  print(f"Time taken for {total_count} rows: {round(time_taken/60, 2)} minutes")
  print(f"Estimate for all transcripts (927): {round((time_taken/total_count * 927)/60/60, 2)} hours")

  return input_df

2.1.2 Generate summarised text for Q&A tablesΒΆ

InΒ [Β ]:
# Load ground truth files
ground_truth_df = pd.read_excel(processed_data_folder + "/ground_truth_JPMorgan_manual.xlsx")

# Load full Q&A table
qa_df = pd.read_csv(processed_data_folder + "/transcripts_tabular_JPMorgan_clean.csv")
InΒ [Β ]:
# Run summarisation function on Ground Truth Q&A dataset (not aggregated)
summarised_df_gt = phi_summarise(ground_truth_df, 'qa_text', 8)
display(summarised_df_gt.head())
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving ground_truth_JPMorgan_manual.xlsx to ground_truth_JPMorgan_manual (5).xlsx
uid qa_type qa_num qa_text true_sentiment true_topic true_Q_outcome
0 JPMorganChase_3Q24_Q_28.0 Q 28 And so Daniel's comments in September were on ... neutral Earnings NaN
1 JPMorganChase_3Q24_A_28.0 A 28 No. Those were core NII or NII ex.. So again, ... negative Earnings NE
2 JPMorganChase_2Q24_Q_26.0 Q 62 Very good. And as a follow-up, you've been ver... neutral Financials NaN
3 JPMorganChase_2Q24_A_26.0 A 62 Yeah. It's a good question. I think the short ... positive Financials NE
4 JPMorganChase_1Q24_Q_30.0 Q 97 Thank you. And I guess, as a tie-in to that qu... neutral M&A | Investments NaN
Processing batch 1/14
Processing batch 2/14
Processing batch 3/14
Processing batch 4/14
Processing batch 5/14
Processing batch 6/14
Processing batch 7/14
Processing batch 8/14
Processing batch 9/14
Processing batch 10/14
Processing batch 11/14
Processing batch 12/14
Processing batch 13/14
Processing batch 14/14
Time taken for 108 rows: 6.94 minutes
Estimate for all transcripts (927): 0.99 hours
uid qa_type qa_num qa_text true_sentiment true_topic true_Q_outcome Summarised text summarised_text
0 JPMorganChase_3Q24_Q_28.0 Q 28 And so Daniel's comments in September were on ... neutral Earnings NaN Daniel seeks clarification on the breakdown of...
1 JPMorganChase_3Q24_A_28.0 A 28 No. Those were core NII or NII ex.. So again, ... negative Earnings NE The current consensus for NII ex. is $87 billi...
2 JPMorganChase_2Q24_Q_26.0 Q 62 Very good. And as a follow-up, you've been ver... neutral Financials NaN The discussion highlights the strong performan...
3 JPMorganChase_2Q24_A_26.0 A 62 Yeah. It's a good question. I think the short ... positive Financials NE The C&I charge-off rate, historically very low...
4 JPMorganChase_1Q24_Q_30.0 Q 97 Thank you. And I guess, as a tie-in to that qu... neutral M&A | Investments NaN The speaker discusses the growth of private cr...
InΒ [Β ]:
# Save summarised ground truth file
summarised_df_gt.to_excel(processed_data_folder + "/phi_ground_truth_summarised.xlsx", index=False)
InΒ [Β ]:
# Run summarisation function on full Q&A tabular dataset (note that 'N' values have been removed, so only Q and A remain)
summarised_df_full = phi_summarise(qa_df, 'qa_text', 8)

# View results
display(summarised_df_full.head())
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving transcripts_tabular_JPMorgan_clean.xlsx to transcripts_tabular_JPMorgan_clean (2).xlsx
uid bank year quarter date section name title firm qa_type qa_num_within qa_num qa_text
0 JPMorganChase_3Q24_N_0.0 JPMorganChase 2024 3 2024-10-11 management_discussion Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. N NaN NaN Thank you and good morning, everyone. Starting...
1 JPMorganChase_3Q24_Q_1.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jim Mitchell Analyst Seaport Global Securities LLC Q 1.0 1.0 Hey, good morning. So, Jeremy, as you highligh...
2 JPMorganChase_3Q24_A_1.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. A 1.0 1.0 Yeah. Sure, Jim. I'll try to answer both quest...
3 JPMorganChase_3Q24_Q_3.0 JPMorganChase 2024 3 2024-10-11 questions_answers Steven Chubak Analyst Wolfe Research LLC Q 3.0 3.0 Hi. Good morning. So Jeremy, how are you? So I...
4 JPMorganChase_3Q24_A_3.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. A 3.0 3.0 Sure. So good question and I agree with your n...
Processing batch 1/116
Processing batch 2/116
Processing batch 3/116
Processing batch 4/116
Processing batch 5/116
Processing batch 6/116
Processing batch 7/116
Processing batch 8/116
Processing batch 9/116
Processing batch 10/116
Processing batch 11/116
Processing batch 12/116
Processing batch 13/116
Processing batch 14/116
Processing batch 15/116
Processing batch 16/116
Processing batch 17/116
Processing batch 18/116
Processing batch 19/116
Processing batch 20/116
Processing batch 21/116
Processing batch 22/116
Processing batch 23/116
Processing batch 24/116
Processing batch 25/116
Processing batch 26/116
Processing batch 27/116
Processing batch 28/116
Processing batch 29/116
Processing batch 30/116
Processing batch 31/116
Processing batch 32/116
Processing batch 33/116
Processing batch 34/116
Processing batch 35/116
Processing batch 36/116
Processing batch 37/116
Processing batch 38/116
Processing batch 39/116
Processing batch 40/116
Processing batch 41/116
Processing batch 42/116
Processing batch 43/116
Processing batch 44/116
Processing batch 45/116
Processing batch 46/116
Processing batch 47/116
Processing batch 48/116
Processing batch 49/116
Processing batch 50/116
Processing batch 51/116
Processing batch 52/116
Processing batch 53/116
Processing batch 54/116
Processing batch 55/116
Processing batch 56/116
Processing batch 57/116
Processing batch 58/116
Processing batch 59/116
Processing batch 60/116
Processing batch 61/116
Processing batch 62/116
Processing batch 63/116
Processing batch 64/116
Processing batch 65/116
Processing batch 66/116
Processing batch 67/116
Processing batch 68/116
Processing batch 69/116
Processing batch 70/116
Processing batch 71/116
Processing batch 72/116
Processing batch 73/116
Processing batch 74/116
Processing batch 75/116
Processing batch 76/116
Processing batch 77/116
Processing batch 78/116
Processing batch 79/116
Processing batch 80/116
Processing batch 81/116
Processing batch 82/116
Processing batch 83/116
Processing batch 84/116
Processing batch 85/116
Processing batch 86/116
Processing batch 87/116
Processing batch 88/116
Processing batch 89/116
Processing batch 90/116
Processing batch 91/116
Processing batch 92/116
Processing batch 93/116
Processing batch 94/116
Processing batch 95/116
Processing batch 96/116
Processing batch 97/116
Processing batch 98/116
Processing batch 99/116
Processing batch 100/116
Processing batch 101/116
Processing batch 102/116
Processing batch 103/116
Processing batch 104/116
Processing batch 105/116
Processing batch 106/116
Processing batch 107/116
Processing batch 108/116
Processing batch 109/116
Processing batch 110/116
Processing batch 111/116
Processing batch 112/116
Processing batch 113/116
Processing batch 114/116
Processing batch 115/116
Processing batch 116/116
Time taken for 926 rows: 58.79 minutes
Estimate for all transcripts (927): 0.98 hours
uid bank year quarter date section name title firm qa_type qa_num_within qa_num qa_text Summarised text summarised_text
0 JPMorganChase_3Q24_N_0.0 JPMorganChase 2024 3 2024-10-11 management_discussion Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. N NaN NaN Thank you and good morning, everyone. Starting... The Firm reported a net income of $12.9 billio...
1 JPMorganChase_3Q24_Q_1.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jim Mitchell Analyst Seaport Global Securities LLC Q 1.0 1.0 Hey, good morning. So, Jeremy, as you highligh... The text discusses concerns about a significan...
2 JPMorganChase_3Q24_A_1.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. A 1.0 1.0 Yeah. Sure, Jim. I'll try to answer both quest... The yield curve is the primary factor driving ...
3 JPMorganChase_3Q24_Q_3.0 JPMorganChase 2024 3 2024-10-11 questions_answers Steven Chubak Analyst Wolfe Research LLC Q 3.0 3.0 Hi. Good morning. So Jeremy, how are you? So I... The consensus expense forecast for the next ye...
4 JPMorganChase_3Q24_A_3.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. A 3.0 3.0 Sure. So good question and I agree with your n... The company's financial performance is influen...
InΒ [Β ]:
# Save summarised file
summarised_df_gt.to_excel(processed_data_folder + "/phi_fulltable_summarised.xlsx", index=False)

These summarised tables (ground truth and full) will be passed as input to finBERT and compared to analysis on non-summarised table data.

2.2 Sentiment AnalysisΒΆ

We compare two sentiment classification models, "yiyanghkust/finbert-tone" and "soleimanian/financial-roberta-large-sentiment", both from huggingface. Both returns either 'positive', 'negative' or 'neutral' where the first model capitalises the first letter.

Finbert-tone is a fine-tuned version of FinBert for sentiment classification on 10,000 manually annotated sentences from analyst reports. On the other hand, the Roberta model was trained on a large corpus including CSR reports, ESG news and eaernings call transcripts.

We are using the ground truth dataset to evluating which model is better.

2.2.0 Load the ground truth datasetΒΆ

A new column in the evaluation dataset is created where the labels are converted to integer. (positive to 1, negative to -1, neutral to 0)

InΒ [Β ]:
score_dict = {'neutral':0, 'positive':1, 'negative':-1}
eval_data = pd.read_excel(processed_data_folder + "/ground_truth_JPMorgan_manual.xlsx")
eval_data.drop(['qa_type', 'qa_num', 'true_topic', 'true_Q_outcome'], inplace=True, axis=1)
eval_data['true_score'] = eval_data['true_sentiment'].map(score_dict)

2.2.1 Class for running sentiment classification modelΒΆ

There are multiple ways to use the sentiment analysis models. The most striaght forward way is getting the label for each of the text, and compare the accuracy, precision, recall and f1_score of the two models on the ground truth. This is stored in a column with suffix '_sentiment'.

Alternatively, we can use the probability outputs of the label to compute a numeric score between -1 and 1. For each text, the formula would simply be $$ \text{score} = \text{probability of positive} - \text{probability of negative.}$$ This is stored in a column with suffix '_score'. Then by converting the truth label to -1, 0 and 1, we can compare the MSE of the two models.

Another way of quantifying the output is breaking each text into sentences and feed each sentence to the model. Then a score can be computed from the number of positive and negative sentences. The formula is $$ \text{score} = \frac{\text{number of positive sentences} - \text{number of negative sentences}}{\text{number of positive sentences} + \text{number of negative sentences}}.$$ This is stored in a column with suffix '_average_sentence_label_score'. We can then compare the the MSE of the two models.

InΒ [Β ]:
## Helper functions
def get_quantiles(num_list, num_chunks):
  """
  Calculate approximate quantiles from a list of numbers, dividing the list into equal chunks

  Args:
    num_list (list of int or float): A list of numeric values to be split into approximately equal chunks, pre-sorted.
    num_chunks (int): Number of chunks the list to be split into.

  Returns:
    list of float or int: A list of values from 'num_list' that correspond to approximate quantiles.

  """
  quantiles = [(i + 1) / num_chunks for i in range(num_chunks - 1)]

  selected_positions = []
  for q in quantiles:
      if num_list:
          pos_index = int(len(num_list) * q) - 1
          if pos_index >= 0:
              selected_positions.append(num_list[min(pos_index, len(num_list) - 1)])
  return selected_positions


def chunk_text(text, num_chunks):
  """
  Split a given text into approximately equal chunks based on positions of periods.

  Args:
    text (str): The text to be split into chunks.
    num_chunks (int): The number of chunks to create.

  Returns:
    list of str: A list of text chunks.
  """
  positions = [index for index, char in enumerate(text) if char == '.']
  if positions:
      split_positions = get_quantiles(positions, num_chunks)

      last_position = 0
      chunks = []
      for pos in split_positions:
          chunks.append(text[last_position:pos + 1].strip())
          last_position = pos + 1
      chunks.append(text[last_position:].strip())
      return chunks


def compute_score_from_prob(response):
  """
  Compute a sentiment score based on probabilities for positive and negative labels of a response.
  The formula is probability of positive - probability of negative.

  Args:
    response (list of dict): A list of dictionaries where each dictionary contains a `label` and a `score`. It would work on any classification response from huggingface.

  Returns:
    float: The computed sentiment score.
  """
  positive_score = next(item['score'] for item in response if item['label'].lower() == 'positive')
  negative_score = next(item['score'] for item in response if item['label'].lower() == 'negative')
  return positive_score - negative_score


def split_string_by_punctuation(text):
  """
  Split a given text into sentences based on specific punctuation marks, while avoiding splitting on periods in abbreviations like i.e. and U.S..

  Args:
    text (str): The text to be split into sentences.

  Returns:
    list of str: A list of sentences.
  """
  punct_regex = r"(?=\S)(?:i.e.|J.P.|U.S.|ex.|[A-Z][a-z]{0,3}\.|[^.?!]|\.(?!\s+[A-Z]))*.?"
  return re.findall(punct_regex, text)


def compute_label_from_score(score):
  if score > 0.2:
      return 'positive'
  elif score < -0.2:
      return 'negative'
  else:
      return 'neutral'
InΒ [Β ]:
## Class
class Model:
  """
  Base class for using huggingface model

  Args:
    name (str): Custom name of the model to be used in dataframe heading.
    huggingface_model (str): The name of the model on huggingface hub, has a structure of {creator}/{model_name}
    task (str): The machine learning task the model performs (should be one of many from huggingface)
  """
  def __init__(self, name, huggingface_model, task):
      self.name = name
      self.huggingface_model = huggingface_model
      self.task = task
      self.model = pipeline(self.task, model=self.huggingface_model)
      self.tokenizer = AutoTokenizer.from_pretrained(self.huggingface_model)
      self.token_max_length = self.model.model.config.max_position_embeddings

  def get_model_response_for_df(self, data, column):
      pass


class ClassificationModel(Model):
    def __init__(self, name, huggingface_model):
        super().__init__(name, huggingface_model, "text-classification")

    def get_model_response_for_df(self, data, column):
        data[self.name + '_sentiment'], data[self.name + '_score'] = zip(
            *data.apply(lambda x: self.get_model_response(x, column), axis=1))
        return data

    def get_model_response(self, row, column):
        input = row[column]
        token_length = len(self.tokenizer(input)['input_ids'])
        if token_length <= self.token_max_length:
            response = self.model(input, top_k=3)
            label = response[0]['label'].lower()
            score = compute_score_from_prob(response)
            return label, score

        else:
            chunk_number = token_length // self.token_max_length + 1
            chunks = chunk_text(input, chunk_number)

            for chunk in chunks:
                if len(self.tokenizer(chunk)['input_ids']) > self.token_max_length:
                    chunk_number = len(self.tokenizer(chunk)['input_ids']) // self.token_max_length + 1
                    chunks.extend(chunk_text(chunk, chunk_number))
                    chunks.remove(chunk)

            responses = self.model(chunks, top_k=3)
            scores = [compute_score_from_prob(response) for response in responses]
            score = sum(scores) / len(scores)
            label = compute_label_from_score(score)

            return label, score

    def get_model_response_from_sentences_df(self, data, column):
        data[self.name + '_average_sentence_label_score'] = data.apply(
            lambda x: self.get_model_response_from_sentences(x, column), axis=1)
        return data

    def get_model_response_from_sentences(self, row, column):
        input = row[column]
        sentences = split_string_by_punctuation(input)
        num_pos = 0
        num_neg = 0
        for sentence in sentences:
            response = self.model(sentence)[0]['label'].lower()
            if response == 'positive':
                num_pos += 1
            if response == 'negative':
                num_neg += 1

        if num_pos == 0 and num_neg == 0:
            return 0
        else:
            return (num_pos - num_neg) / (num_pos + num_neg)

2.2.2 Comparing the two modelsΒΆ

InΒ [Β ]:
def get_model_accuracy(model_name, data):
  true_response = data['true_sentiment'].to_list()
  model_response = data[model_name+'_sentiment'].to_list()
  cf = ConfusionMatrixDisplay(confusion_matrix(true_response, model_response), display_labels=['negative', 'neutral', 'positive'])
  cf.plot()
  colorbar = cf.ax_.images[0].colorbar
  colorbar.set_label('Frequency')
  plt.title(f"Confusion matrix for {model_name}")
  plt.show()
  print(classification_report(true_response, model_response))
InΒ [Β ]:
# Load the two models and run them on the ground truth data set
finbert_model = ClassificationModel("finbert-tone", "yiyanghkust/finbert-tone")
eval_data = finbert_model.get_model_response_for_df(eval_data, 'qa_text')
roberta_model = ClassificationModel("financial-roberta-large", "soleimanian/financial-roberta-large-sentiment")
eval_data = roberta_model.get_model_response_for_df(eval_data, 'qa_text')
InΒ [Β ]:
# save the results of ground truth data set in the output folder
eval_data.to_csv(output_data_folder + "/sentiment_eval_result_ground_truth.csv", index=False)
InΒ [Β ]:
from IPython.display import display
from ipywidgets import Output, HBox

out1 = Output()
out2 = Output()

with out1:
  get_model_accuracy("finbert-tone", eval_data)
with out2:
  get_model_accuracy("financial-roberta-large", eval_data)
display(HBox([out1, out2]))

From the result, we can see that the Roberta model has a slightly higher accuracy of 67% compared to FinBert's 65%. Although the model has similar f1-score for neutral, FinBert's precision and recall are unbalanced, suggesting that the FinBert model is quite conservative and predict most labels as 'neutral'. This is also reflected in the recall for 'positive' and 'negative'.

For our analysis, we are more interested in the positive and negative classifications than the neutral ones, so it is better to choose the Roberta model.

2.2.3 Running the model on the full data setΒΆ

InΒ [Β ]:
# load the full dataset
full_data = pd.read_csv(processed_data_folder + "transcripts_tabular_JPMorgan_clean.csv")
full_data = full_data[full_data['qa_type'].isin(['Q', 'A'])]
# running the model
roberta_model = ClassificationModel("financial-roberta-large", "soleimanian/financial-roberta-large-sentiment")
full_data = roberta_model.get_model_response_for_df(full_data, 'qa_text')
# save the response
full_data.to_csv(output_data_folder + "/sentiment_full_result.csv", index=False)

2.3 Topic modellingΒΆ

2.3.1 FinBERT classificationΒΆ

2.3.1.0 Setting up the modelΒΆ

InΒ [Β ]:
# connecting to huggingface
huggingface_token = userdata.get("huggingface_token")
!huggingface-cli login --token $huggingface_token

# loading the tokeniser and the model
finbert_topic_tokeniser = AutoTokenizer.from_pretrained("nickmuchi/finbert-tone-finetuned-finance-topic-classification")
finbert_topic_model = AutoModelForSequenceClassification.from_pretrained("nickmuchi/finbert-tone-finetuned-finance-topic-classification")
InΒ [Β ]:
# define a label dictionary to help us interpret the labels predicted by the model
# (obtained from HuggingFace)
id2label= {
    0: "Analyst Update",
    1: "Fed | Central Banks",
    2: "Company | Product News",
    3: "Treasuries | Corporate Debt",
    4: "Dividend",
    5: "Earnings",
    6: "Energy | Oil",
    7: "Financials",
    8: "Currencies",
    9: "General News | Opinion",
    10: "Gold | Metals | Materials",
    11: "IPO",
    12: "Legal | Regulation",
    13: "M&A | Investments",
    14: "Macro",
    15: "Markets",
    16: "Politics",
    17: "Personnel Change",
    18: "Stock Commentary",
    19: "Stock Movement"
  }
InΒ [Β ]:
# Check token lengths
def plot_finbert_token_lengths(data_folder, finbert_tokeniser, max_length=512,
                               qa_only=True, summarised=False, appdx=""):
    """
    given the path to the data folder and the FinBERT tokeniser,
    this function returns an array of token lengths
    and plots their distribution

    max_length (int) : max token length - will be plotted!
    qa_only (bool) : keeps only non-nan qa_num if true
    """
    banks = ['JPMorgan']
    print("Which bank's data do you wish to label?")
    bank = input(banks)

    # load the ddata
    if summarised:
        summarised_text_path = os.path.join(data_folder, f"phi_fulltable{appdx}.xlsx")
        df = pd.read_excel(summarised_text_path)
        df = df.drop(['qa_text'], axis=1)
        df.rename(columns={'summarised_text': 'qa_text'}, inplace=True)
    else:
        transcript_csv_path = os.path.join(data_folder, f"transcripts_tabular_{bank}_clean.csv")
        df = pd.read_csv(transcript_csv_path)
    if qa_only:
      df = df[~df.qa_num.isna()].reset_index(drop=True)
    arr = df.qa_text

    token_lengths = []
    # estimate token length without truncation
    for txt in arr:
        inputs_nontruncated = finbert_tokeniser(
                                    txt,
                                    return_tensors="pt",
                                    truncation=False,
                                    padding=False
                                    )
        token_lengths.append(len(inputs_nontruncated[0]))

    fig, ax = plt.subplots(1,1, figsize=(3,3))
    sns.histplot(token_lengths, ax =ax)
    ax.vlines(max_length, 0, 140, ls='dashed', color='black')
    ax.set_xlabel("Token length")
    ax.set_ylabel("Number of texts")

    num_too_long = (np.array(token_lengths)>max_length).sum()
    print(f"Max token length is exceeded by {num_too_long} ({num_too_long*100/arr.shape[0]:.0f}%) entries")

plot_finbert_token_lengths(processed_data_folder, finbert_topic_tokeniser, max_length=512)
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
Max token length is exceeded by 23 (3%) entries
No description has been provided for this image

FinBERT can only handle inputs of up to 512 tokens in length. If our texts are substantially longer, truncation could diminish the accuracy of the model. Therefore, we will begin by performing just the tokenisation step to check how long the data is.

Evidently, only 3% of entries exceed token length, so this is not a major problem. Still, we will include chunking at least as an option in our FinBERT topic assignment function. Depending on classification accuracy, it might even be of interest to consider shorter than max allowed token max lengths.

2.3.1.1 Helper functionsΒΆ

2.3.1.1.1 Function for label predictionΒΆ

We will define a function get_finbert_topics() that will take a string input (one question or answer) and generate:

  • predicted label ID
  • the corresponding label
  • logits for all possible topics

If chunking=True, texts longer than max_length will be split into equal-sized chunks and a label will be predicted for each chunk. If the chunks have different labels, all will be returned.

InΒ [Β ]:
# function for label prediction
def get_finbert_topics(txt, finbert_tokeniser, finbert_model, label_dict,
                       chunking=False, max_length=512):
    """
    given an input string, tokeniser, model, and a dictionary of labels,
    this function returns the predicted label id, predicted label, and the full
    array of logits for that input string

    chunking (bool): if True, texts longer than 512 tokens are split into the
            minimum number of ~equal-size chunks possible for each chunk to be
            smaller than 512 tokens. A label is then predicted for each chunk.
            If the labels do not match, the label gets assigned based on the max
            logit across chunks.
    """
    MAX_LENGTH = max_length

    if chunking:
        # estimate token length without truncation
        inputs_nontruncated = finbert_tokeniser(
                                    txt,
                                    return_tensors="pt",
                                    truncation=False,
                                    padding=False
                                    )
        token_length = len(inputs_nontruncated[0])
        if token_length>MAX_LENGTH:
            print(f"Token exceeds max length: {txt[:100]}")
            num_chunks = (token_length//MAX_LENGTH)+1
            txt_split = txt.split(" ")
            chunk_length_str=int(np.ceil(len(txt_split)/num_chunks))
            if len(txt_split)%num_chunks==1:  # odd len(txt_split)
              chunk_length_str += 1

            txt_list = [" ".join(txt_split[i:i + chunk_length_str]) for i in range(0, len(txt_split), chunk_length_str)]
        else:
            txt_list = [txt]
    else:
        txt_list = [txt]

    predicted_label_ids = []
    predicted_logits = []
    for i, txt_chunk in enumerate(txt_list):
        inputs = finbert_tokeniser(
            txt_chunk,
            return_tensors="pt",
            truncation=True,
            padding=True,
            max_length=MAX_LENGTH
            )

        with torch.no_grad():
            outputs = finbert_model(**inputs)
            logits = outputs.logits

        # get the predicted label
        predicted_label_ids.append(torch.argmax(logits, dim=1).item())
        predicted_logits.append(logits.detach().cpu().numpy().flatten())

    # check if predicted labels are the same for all chunks
    # if so, assign a single label to the input row
    if len(set(predicted_label_ids))==1:
        predicted_label = label_dict[predicted_label_ids[0]]
        return predicted_label_ids[0], predicted_label, predicted_logits[0], None
    else:
        # return all predictions, as well as the text chunks
        predicted_labels = [label_dict[predicted_label_ids[i]] for i in range(len(predicted_label_ids))]
        return predicted_label_ids, predicted_labels, predicted_logits, txt_list
2.3.1.1.2 Function for label storageΒΆ

Next, we will define a function process_finbert that:

  • checks if a labelled dataset with the same chunking and max length already exists in the folder, else it will create a new file
  • runs get_finbert_topics() and handles multiple chunks
    • if multiple labels are returned, the first will replace the existing row and the others will be appended to the labelled dataset as new rows. The uid of such new rows will be the same as that of the existing row to allow one-sided merging of this and the original dataset.
  • allows for labelling of only the Q&A sections (qa_only=True) or the management discussion section as well (qa_only=False)
InΒ [Β ]:
# Function for label storage
def process_finbert(finbert_tokeniser, finbert_model, save_folder, label_dict,
                    data_folder=None, chunking=False, max_length=512, qa_only=False,
                    summarised=False, synthetic=False, appdx=""):
    # define the bank of interest
    banks = ['JPMorgan']
    print("Which bank's data do you wish to label?")
    bank = input(banks)

    # define the columns of interest
    cols_to_add = ['finbert_topic_id', 'finbert_topic_label']
    [cols_to_add.append(f"topic_{i}_logit") for i in label_dict.keys()]

    # load (partly) labelled dataset (if it exists), else create the file
    finbert_topics_path = os.path.join(save_folder, f"finbert_topics_{bank}_chunking{chunking}_maxlength{max_length}{appdx}.csv")
    if os.path.exists(finbert_topics_path):
        print("Loading (partly) labelled data...")
        finbert_df = pd.read_csv(finbert_topics_path)
        transcripts_qa = finbert_df.copy()
    else:
        print("Creating a new dataset for labelled data...")
        if data_folder is None:
            raise ValueError("Data folder not supplied!")

        if summarised:
            summarised_text_path = os.path.join(data_folder, f"phi_fulltable{appdx}.xlsx")
            transcripts_all = pd.read_excel(summarised_text_path)
            transcripts_all = transcripts_all.drop(['qa_text'], axis=1)
            transcripts_all.rename(columns={'summarised_text': 'qa_text'}, inplace=True)
        elif synthetic:
            synthetic_text_path = os.path.join(data_folder, f"synthetic_data_for_finbert.csv")
            transcripts_all = pd.read_csv(synthetic_text_path)
        else:
            transcript_csv_path = os.path.join(data_folder, f"transcripts_tabular_{bank}_clean.csv")
            transcripts_all = pd.read_csv(transcript_csv_path)

        # subset only Q&A
        if qa_only:
            transcripts_qa = transcripts_all[~transcripts_all['qa_num'].isna()].copy().reset_index(drop=True)
        else:
            transcripts_qa = transcripts_all.copy()

        # keep only the strictly necessary subset of columns
        cols_to_keep = ['uid', 'qa_text']
        finbert_df = transcripts_qa[cols_to_keep].copy()

        # add cols for topic labels
        for col in cols_to_add:
            finbert_df.loc[:,col] = np.nan

        # save file
        finbert_df.to_csv(finbert_topics_path, index=False)

    # iterate over texts and save labels in each loop
    for i, row in transcripts_qa.iterrows():
        print(f"Processing text {i}/{transcripts_qa.shape[0]}...")
        predicted_label_id, predicted_label, logits, chunks = get_finbert_topics(
                                    txt=row['qa_text'],
                                    finbert_tokeniser=finbert_tokeniser,
                                    finbert_model=finbert_model,
                                    label_dict=label_dict,
                                    chunking=chunking,
                                    max_length=max_length
                                    )

        if chunks is not None:
            print(f"Multiple labels detected!")
            print(f"-------label: {predicted_label[0]}")
            finbert_df.loc[finbert_df['uid']==row['uid'], "qa_text"] = chunks[0]
            finbert_df.loc[finbert_df['uid']==row['uid'], cols_to_add[0]] = predicted_label_id[0]
            finbert_df.loc[finbert_df['uid']==row['uid'], cols_to_add[1]] = predicted_label[0]
            finbert_df.loc[finbert_df['uid']==row['uid'], cols_to_add[2:]] = logits[0]

            for k in range(1, len(chunks)):
                print(f"-------label: {predicted_label[k]}")
                row_to_add = pd.DataFrame({
                    "uid": [row['uid']],
                    "qa_text": [chunks[k]],
                    cols_to_add[0]: [predicted_label_id[k]],
                    cols_to_add[1]: [predicted_label[k]],
                    })
                row_to_add[cols_to_add[2:]] = logits[k]
                finbert_df = pd.concat((finbert_df, row_to_add), ignore_index=True)
        else:
            print(f"-------label: {predicted_label}")
            finbert_df.loc[finbert_df.uid==row.uid, cols_to_add[0]] = predicted_label_id
            finbert_df.loc[finbert_df.uid==row.uid, cols_to_add[1]] = predicted_label
            finbert_df.loc[finbert_df.uid==row.uid, cols_to_add[2:]] = logits

        # update the file
        finbert_df.to_csv(finbert_topics_path, index=False)
2.3.1.1.3 Function for model evaluationΒΆ

We have a "ground truth" dataset where topics from the same topic list have been manually assigned to two transcripts' worth of randomly sampled question-answer pairs. This dataset can be used to evaluate model performance.

This is a multi-class problem and the datasets can be expected to be quite imbalanced. Therefore, we will examine precision and recall by printing a classification report. We will also generate two confusion matrices - one normalised by the column (predicted label class) and the other normalised by the row (true label). This will help us see how misclassifications arise.

InΒ [Β ]:
def evaluate_model_performance(data_folder, finbert_folder, label_dict,
                               chunking=False, max_length=512, appdx="",
                               save_ground_truth=False):
    # define the bank of interest
    banks = ['JPMorgan']
    print("Which bank's data do you wish to label?")
    bank = input(banks)

    # load ground truth data
    path_ground_truth = os.path.join(data_folder, f"ground_truth_{bank}_manual.xlsx")
    df_ground_truth = pd.read_excel(path_ground_truth)

    # load model labels
    path_model = os.path.join(finbert_folder, f"finbert_topics_{bank}_chunking{chunking}_maxlength{max_length}{appdx}.csv")
    df_finbert = pd.read_csv(path_model)

    # merge the datasets on uid (multiple labels per uid possible)
    df_merged = df_finbert.merge(df_ground_truth, on=['uid'], how='left', suffixes=['_model', '_true'])

    # exclude rows with no ground truth
    df_merged = df_merged.dropna(axis=0, subset=['true_topic'])

    if save_ground_truth:
        logit_cols = [col for col in df_merged.columns if 'logit' in col]
        df_merged_short = df_merged.drop(['qa_text_model','qa_text_true',
                                          'true_sentiment', 'true_Q_outcome',
                                          'qa_num', 'finbert_topic_id', 'qa_type'] + logit_cols,
                                         axis=1)
        df_merged_short.to_csv(os.path.join(finbert_folder, f"finbert_topics_ground_truth_{bank}_chunking{chunking}_maxlength{max_length}_QA{appdx}.csv"),
                         index=False)

    print(f"There are {df_merged.shape[0]} texts in the evaluated dataset.")
    # get the topics covered in labelled and ground truth datasets
    topics = np.unique(np.concatenate((df_merged['true_topic'],df_merged['finbert_topic_label'])))

    # get classification report
    report = classification_report(
                  df_merged['true_topic'],
                  df_merged['finbert_topic_label'],
                  zero_division=0
                  )
    print(report)

    # plot confusion matrix
    fig, axes = plt.subplots(1,2, figsize=(8,3), sharey=True)
    for i, (norm, tlt) in enumerate(zip(['pred', 'true'], ['Normalised by predicted', 'Normalised by true'])):
        conf_matrix = confusion_matrix(df_merged['true_topic'],
                                       df_merged['finbert_topic_label'],
                                       normalize=norm)
        sns.heatmap(conf_matrix, ax=axes[i], cmap='crest')
        axes[i].set_xticklabels(topics, rotation=90)
        axes[i].set_xlabel("Predicted label")
        axes[i].set_title(tlt)
    axes[0].set_ylabel("True label")
    axes[0].set_yticklabels(topics, rotation=0)

    # plot confusion matrix - just one for presentation
    fig2, axes2 = plt.subplots(1,1, figsize=(9,8), sharey=True)
    conf_matrix = confusion_matrix(df_merged['true_topic'],
                                    df_merged['finbert_topic_label'],
                                    normalize='true')
    sns.heatmap(conf_matrix, ax=axes2, cmap='crest')
    fsize=19
    axes2.set_xticklabels(topics, rotation=90, fontsize=fsize)
    axes2.set_xlabel("Predicted label", fontsize=fsize)
    axes2.set_title("Confusion matrix", fontsize=fsize)
    axes2.set_ylabel("True label", fontsize=fsize)
    axes2.set_yticklabels(topics, rotation=0, fontsize=fsize)
    cbar = axes2.collections[0].colorbar
    cbar.ax.tick_params(labelsize=fsize)
    cbar.set_label("Fraction of true label", fontsize=fsize)
    plt.tight_layout()
2.3.1.1.4 Functions for plottingΒΆ

2.3.1.2 Choosing the optimal chunking methodΒΆ

  • get_merged_data_for_plotting will allow the user to select the bank and the type of text (Q, A, Q+A, management discussion, or all) that is of interest. It will also merge labelled data with the full dataset, add aquarter_str column for easy selection of a specific quarter, and compute probabilities from logits for each text.

  • softmax(logits) will compute probabilities of topic occurrence based on the multi-class logits provided by the model.

  • get_optimal_subplot_dims(n) will compute the number of rows and columns needed in a set of subplots based on the total number of subplots.

  • get_stats_stars(pval, p_thresholds=[0.05,0.01,0.001]) will express the statistical significance of a p-value through 1-3 stars based on significance thresholds (* p<0.05, ** p<0.01, *** p<0.001 as a default)

InΒ [Β ]:
def get_merged_data_for_plotting(finbert_folder, data_folder, label_dict,
                               chunking=True, max_length=512, bank='JPMorgan',
                                 datatype='all', appdx="", summarised=False,
                                 synthetic=False, sentiment=None, sentiment_folder=None):
    """
    merge and subset datasets before plotting

    datatype (str) can be 'Q' (questions only), 'A' (answers only),
                      'QA' (Q+A with qa_num only), 'presentation' (management discussion)
                      or 'all' (default; to proceed without subsetting )
    """
    # load model labels
    path_model = os.path.join(finbert_folder, f"finbert_topics_{bank}_chunking{chunking}_maxlength{max_length}{appdx}.csv")
    df_finbert = pd.read_csv(path_model)

    # load cleaned data
    if summarised:
            summarised_text_path = os.path.join(data_folder, f"phi_fulltable{appdx}.xlsx")
            transcripts_all = pd.read_excel(summarised_text_path)
            transcripts_all = transcripts_all.drop(['qa_text'], axis=1)
            transcripts_all.rename(columns={'summarised_text': 'qa_text'}, inplace=True)
    elif synthetic:
        synthetic_text_path = os.path.join(data_folder, f"synthetic_data_for_finbert.csv")
        transcripts_all = pd.read_csv(synthetic_text_path)
    else:
        transcript_csv_path = os.path.join(data_folder, f"transcripts_tabular_{bank}_clean.csv")
        transcripts_all = pd.read_csv(transcript_csv_path)
    transcripts_all = transcripts_all.drop(['qa_text'], axis=1)

    # merge the datasets and drop unlabelled rows
    df_merged = df_finbert.merge(transcripts_all, on=['uid'], how='left')
    df_merged = df_merged.dropna(axis=0, subset=['finbert_topic_label'])

    # compute probabilities
    logit_cols = [f"topic_{i}_logit" for i in label_dict.keys()]
    prob_cols = [f"topic_{i}_prob" for i in label_dict.keys()]
    for pc in prob_cols:
      df_merged[pc] = np.nan
    df_merged[prob_cols] = softmax(df_merged[logit_cols].to_numpy(), axis=1)

    if not synthetic:
        # add quarters
        df_merged['quarter_str'] = [f"{q}Q{str(y)[-2:]}" for q, y in zip(df_merged['quarter'], df_merged['year'])]

        # order quarters
        quarter_str_order = df_merged['quarter_str'].unique()[::-1]
        df_merged['quarter_str'] = pd.Categorical(df_merged['quarter_str'], categories=quarter_str_order, ordered=True)
        # print("QUARTER ORDER: ", quarter_str_order)
    else:
        quarter_str_order = None

    # subsetting data type
    if datatype in ["Q", "A"]:
        df_merged = df_merged[df_merged['qa_type']==datatype].copy().reset_index(drop=True)
    elif datatype == "QA":
        df_merged = df_merged[~df_merged['qa_num'].isna()].copy().reset_index(drop=True)
    elif datatype == "presentation":
        df_merged = df_merged[df_merged['section']=='management_discussion'].copy().reset_index(drop=True)

    if sentiment is not None and sentiment_folder is not None:
        sentiment_path = os.path.join(sentiment_folder, "full_result.csv")
        sentiment_df = pd.read_csv(sentiment_path)
        sentiment_colname = 'financial-roberta-large_sentiment'
        sentiment_df = sentiment_df[['uid', sentiment_colname]]
        df_merged_sentiment = pd.merge(df_merged, sentiment_df, on=['uid'], how='left')
        if sentiment=='negative minus positive':
            pass
        else:
            df_merged = df_merged_sentiment[df_merged_sentiment[sentiment_colname].eq(sentiment)].copy().reset_index()

    return df_merged, quarter_str_order, (bank, datatype)

# derive probabilities from logits
def softmax(logits, axis=1):
    exp_logits = np.exp(logits - np.nanmax(logits))
    return exp_logits / exp_logits.sum(axis=axis).reshape(-1,1)

def get_optimal_subplot_dims(n):
    rows = math.floor(math.sqrt(n))
    cols = math.ceil(n / rows)

    if rows * cols < n:
        rows += 1

    return rows, cols

def get_stats_stars(pval, p_thresholds=[0.05,0.01,0.001]):
    p_thresholds = np.asarray(p_thresholds)
    ptext = "*" * (pval<p_thresholds).sum()
    if (pval<p_thresholds).sum()==0:
        ptext="n.s."

    return ptext
2.3.1.2.1 chunking=False, maxlength=512ΒΆ
InΒ [Β ]:
# generate predictions
process_finbert(
        finbert_tokeniser=finbert_topic_tokeniser,
        finbert_model=finbert_topic_model,
        save_folder=output_data_folder,
        label_dict=id2label,
        data_folder=output_data_folder,
        chunking=False,
        max_length=512,
        qa_only=True
        )
InΒ [Β ]:
# evaluate model performance
evaluate_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=False,
                          max_length=512,
                          save_ground_truth=True
                          )
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
There are 108 texts in the evaluated dataset.
                             precision    recall  f1-score   support

     Company | Product News       0.25      0.06      0.10        16
                   Earnings       0.00      0.00      0.00        24
        Fed | Central Banks       0.18      0.33      0.24         6
                 Financials       0.00      0.00      0.00        28
     General News | Opinion       0.11      0.90      0.19        10
         Legal | Regulation       0.00      0.00      0.00         7
          M&A | Investments       0.00      0.00      0.00         7
                      Macro       0.00      0.00      0.00         3
                    Markets       0.00      0.00      0.00         5
                   Politics       0.00      0.00      0.00         0
           Stock Commentary       0.00      0.00      0.00         0
Treasuries | Corporate Debt       0.50      0.50      0.50         2

                   accuracy                           0.12       108
                  macro avg       0.09      0.15      0.09       108
               weighted avg       0.07      0.12      0.05       108

No description has been provided for this image
No description has been provided for this image

Evidently, model accuracy is above chance level (5%).

Still, correct classification of just 12% of inputs cannot be considered a good performance.

Examining finer-grained performance metrics:

  • General News | Opinion has very high recall (0.90), but fairly low precision (0.11). Likewise, the indicating that the model is prone to incorrectly applying this rather uninformative label to almost all other topics (see the General News | Opinion column in the right heatmap).
    • Politics and Stock Commentary do not seem affected by this misclassification only because there are no instances of these labels in the ground truth dataset.
    • Texts belonging to Fed | Central Banks appear to be the least prone to being misclassified as General News | Opinion.
  • Fed | Central Banks and Treasuries | Corporate Debt also have higher recall than most topics, and non-zero precision.
  • All instances that the model classified as M&A | Investments, Macro, and Stock Commentary are labelled as Earnings in the ground truth dataset. In fact, no instance of Earnings in the ground truth dataset is labelled as such by the model.
    • This potentially reflects a different interpretation of the Earnings topic by the financial experts, who labelled the data FinBERT was fine-tuned on, and the human labeller who created the "ground truth" dataset used here despite having no background in finance. In the current "ground truth" dataset, texts about NII in particular tended to be classified as Earnings...
  • The model also has not classified any of these texts as Legal | Regulation (7 instances in the ground truth dataset) or Markets (5 instances).
  • The ground truth labeller primarily struggled to discriminate between Financials and Earnings; Legal | Regulation and Fed | Central Banks; Product | Company News and M&A | Investment. It is interesting that, rather than confusing specific pairs of classes, the model seems to have a tendency to label texts as General News | Opinion or (less often) Fed | Central Banks.

To assess model performance beyond its tendency to classify texts as "General News | Opinion", we will now define a function that ignores the probability of "General News | Opinion" and assigns labels based on the probabilities of the remaining topics.

2.3.1.2.2 Defining a function to reclassify texts ignoring "General News | Opinions"ΒΆ

Preventing any texts to be classified as "General News | Opinion" is not ideal, because it really is the most suitable topic in some cases. Based on the randomly sampled ground truth dataset, we could expect about 9% of the texts to truly reflect general news or opinion. Unfortunately, logit values do not allow us to separate this minority of cases from many others where this topic is assigned erroneously. For example,

  • All right. Thanks for all that. -- would fit under "General News | Opinion" and has a logit value for this topic of 3.77
  • Very good. Appreciate the color and candor, as always. Thank you. -- would also fit under "General News | Opinion", but has a logit value of 5.70
  • If we adjust [the consensus expense forecast] for the one-timers this year, that would suggest a core expense base that's just below $90 billion, so pretty healthy step-up in expenses. I know you've always had a strong commitment and discipline around investment. Just want to better understand where those incremental dollars are being deployed and just which investments are being prioritized in particular looking out to next year. -- This is about investments, yet its logit value (4.26) is between the two more general texts.

Therefore, a complete removal of this topic label will be the most effective approach.

InΒ [Β ]:
def reclassify_finbert(finbert_folder, label_dict, exclude='General News | Opinions',
                       chunking=False, max_length=512, appdx=""):
    # define the bank of interest
    banks = ['JPMorgan']
    print("Which bank's data do you wish to label?")
    bank = input(banks)

    # load model labels
    path_model = os.path.join(finbert_folder, f"finbert_topics_{bank}_chunking{chunking}_maxlength{max_length}{appdx}.csv")
    path_model_new = os.path.join(finbert_folder, f"finbert_topics_{bank}_chunking{chunking}_maxlength{max_length}{appdx}_relabelled.csv")
    df_finbert = pd.read_csv(path_model)

    # find key to exclude and replace the respective col with nans
    key_to_exclude = [key for key,val in label_dict.items() if val==exclude][0]

    # isolate logit cols, except the key to exclude
    logit_cols = [col for col in df_finbert.columns if "_logit" in col and f"{key_to_exclude}_logit" not in col]

    df_finbert['finbert_topic_id'] = [int(i.split("_")[1]) for i in df_finbert[logit_cols].idxmax(axis=1)]
    df_finbert['finbert_topic_label'] = [label_dict[i] for i in df_finbert['finbert_topic_id']]

    df_finbert.to_csv(path_model_new, index=False)
InΒ [Β ]:
# reclassify the model
reclassify_finbert(
                  finbert_folder=output_data_folder,
                  label_dict=id2label,
                  chunking=False,
                  max_length=512,
                  exclude='General News | Opinion'
                  )

evaluate_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=False,
                          max_length=512,
                          appdx='_relabelled'
                          )
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
There are 108 texts in the evaluated dataset.
                             precision    recall  f1-score   support

     Company | Product News       0.52      1.00      0.68        16
                   Earnings       0.00      0.00      0.00        24
        Fed | Central Banks       0.12      0.33      0.18         6
                 Financials       0.17      0.04      0.06        28
     General News | Opinion       0.00      0.00      0.00        10
         Legal | Regulation       0.00      0.00      0.00         7
          M&A | Investments       0.00      0.00      0.00         7
                      Macro       0.04      0.33      0.07         3
                    Markets       0.00      0.00      0.00         5
                   Politics       0.00      0.00      0.00         0
           Stock Commentary       0.00      0.00      0.00         0
Treasuries | Corporate Debt       0.25      0.50      0.33         2

                   accuracy                           0.19       108
                  macro avg       0.09      0.18      0.11       108
               weighted avg       0.13      0.19      0.13       108

No description has been provided for this image

If texts are relabelled based on the max topic probability, ignoring "General News | Opinion", model accuracy increases from 12% to 19%.

The confusions are also more in line with the uncertainties of the human labeller:

  • "Earnings" misclassified as "Macro", "Financials", "Stock Commentary"
  • "Legal | Regulation" misclassified as "Fed | Central Banks"
  • "Fed | Central Banks" misclassified as "Politics"
  • "M&A | Investments" misclassified as "Company | Product News" (also as "Stock Commentary")
2.3.1.2.3 Defining a function to plot model logits for ground truth labelsΒΆ

A better way to visualise the behaviour of the model would be by looking beyond the assigned labels and plotting the logit density distributions of various ground truth labels directly.

The following function will take the subset of data with a particular ground truth label and plot the density distributions of model logits. Kernel density curves with peaks above 0 (chance level logit) will be highlighted and included in the legend of the plot.

InΒ [Β ]:
def plot_model_logits(true_topic, data_folder, finbert_folder, label_dict,
                               chunking=False, max_length=512, appdx="", bank='JPMorgan',
                      save_topic=None):

    # load ground truth data
    path_ground_truth = os.path.join(data_folder, f"ground_truth_{bank}.csv")
    df_ground_truth = pd.read_csv(path_ground_truth)

    # load model labels
    path_model = os.path.join(finbert_folder, f"finbert_topics_{bank}_chunking{chunking}_maxlength{max_length}{appdx}.csv")
    df_finbert = pd.read_csv(path_model)

    # merge the datasets on uid (multiple labels per uid possible)
    df_merged = df_finbert.merge(df_ground_truth, on=['uid'], how='left', suffixes=['_model', '_true'])

    # exclude rows with no ground truth
    df_merged = df_merged.dropna(axis=0, subset=['true_topic'])


    logit_cols = [f"topic_{i}_logit" for i in label_dict.keys()]

    df_logits = df_merged[df_merged['true_topic']==true_topic][logit_cols].copy()
    cmap = sns.color_palette("tab20")

    x_range = np.linspace(-10, 10, 1000)
    fig,ax = plt.subplots(1,1, figsize=(4,2))
    for col in df_logits.columns:
        i = int(col.split("_")[1])  # get topic label id

        kde = gaussian_kde(df_logits.loc[:,col].values, bw_method="scott")
        kde_values = kde(x_range)
        kde_peak = x_range[np.argmax(kde_values)]

        if kde_peak<=0:
            sns.kdeplot(df_logits.loc[:,col].values, color='grey', alpha=0.2, lw=1)
        else:
            sns.kdeplot(df_logits.loc[:,col].values, color=cmap[i], alpha=1, lw=2, label=label_dict[i])
    fsize=12
    ax.set_title(f"True label: {topic}", fontsize=fsize)
    ax.set_xlabel("FinBERT logit", fontsize=fsize)
    ax.set_ylabel("Density", fontsize=fsize)
    plt.tick_params(axis='both', which='major', labelsize=fsize)
    plt.legend(loc='upper left', bbox_to_anchor=(1.02, 1), prop={'size': fsize})
InΒ [Β ]:
for topic in ['Company | Product News', 'Earnings', 'Fed | Central Banks',
               'Financials', 'Legal | Regulation', 'M&A | Investments']:
    plot_model_logits(true_topic=topic,
                      data_folder=processed_data_folder,
                      finbert_folder=output_data_folder,
                      label_dict=id2label,
                      chunking=False,
                      max_length=512,
                      bank='JPMorgan'
                      )
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

These plots confirm that texts manually labelled as, for example, "M&A | Investments" often get classified as "Company | Product News", which is a pair of categories our human labeller often struggled to discriminate as well. Similarly, texts manually labelled as "Legal | Regulation" (which the model seems to rarely use for Q&A texts) often get classified as "Fed | Central Banks".

2.3.1.2.4 chunking=True, maxlength=512ΒΆ

With these arguments, longer texts (3% of the data) will be chunked into equal-sized texts. All information will thus be considered for classification. Some uid will no longer be unique.

InΒ [Β ]:
process_finbert(
        finbert_tokeniser=finbert_topic_tokeniser,
        finbert_model=finbert_topic_model,
        save_folder=output_data_folder,
        label_dict=id2label,
        data_folder=processed_data_folder,
        chunking=True,
        max_length=512,
        qa_only=False
        )
InΒ [Β ]:
evaluate_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=True,
                          max_length=512
                          )
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
There are 108 texts in the evaluated dataset.
                             precision    recall  f1-score   support

     Company | Product News       0.25      0.06      0.10        16
                   Earnings       0.00      0.00      0.00        24
        Fed | Central Banks       0.18      0.33      0.24         6
                 Financials       0.00      0.00      0.00        28
     General News | Opinion       0.11      0.90      0.19        10
         Legal | Regulation       0.00      0.00      0.00         7
          M&A | Investments       0.00      0.00      0.00         7
                      Macro       0.00      0.00      0.00         3
                    Markets       0.00      0.00      0.00         5
                   Politics       0.00      0.00      0.00         0
           Stock Commentary       0.00      0.00      0.00         0
Treasuries | Corporate Debt       0.50      0.50      0.50         2

                   accuracy                           0.12       108
                  macro avg       0.09      0.15      0.09       108
               weighted avg       0.07      0.12      0.05       108

No description has been provided for this image

Since the evaluated dataset still contains 108 texts, it is either that none of the chunked texts are part of the ground truth dataset or that all chunked texts in the ground truth dataset ended up with the same label. Thus, it is unsurprising that the evaluation results are exactly the same as before.

We will relabel the output of this model for later use, but will not re-evaluate the relabelled outcome, since the above result already shows that there will be no difference to the previous re-evaluation (2.3.1.2.2).

InΒ [Β ]:
# reclassify the model
reclassify_finbert(
                  finbert_folder=output_data_folder,
                  label_dict=id2label,
                  chunking=True,
                  max_length=512,
                  exclude='General News | Opinion'
                  )
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
2.3.1.2.5 chunking=True, maxlength=128ΒΆ

We will also try shorter chunks to see if it can help reduce the prevalence of the General News | Opinion label and improve model accuracy.

First, let us check how many texts will get chunked now.

InΒ [Β ]:
plot_finbert_token_lengths(processed_data_folder, finbert_topic_tokeniser, max_length=128)
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
Max token length is exceeded by 368 (41%) entries
No description has been provided for this image

Now, we will generate the labels.

InΒ [Β ]:
process_finbert(
        finbert_tokeniser=finbert_topic_tokeniser,
        finbert_model=finbert_topic_model,
        save_folder=output_data_folder,
        label_dict=id2label,
        data_folder=processed_data_folder,
        chunking=True,
        max_length=128,
        qa_only=False
        )

Finally, we can evaluate the model's performance on this chunked dataset.

InΒ [Β ]:
evaluate_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=True,
                          max_length=128,
                          )
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
There are 128 texts in the evaluated dataset.
                             precision    recall  f1-score   support

     Company | Product News       0.33      0.12      0.17        17
                   Earnings       0.00      0.00      0.00        26
        Fed | Central Banks       0.12      0.29      0.17         7
                 Financials       0.00      0.00      0.00        39
     General News | Opinion       0.11      0.83      0.20        12
         Legal | Regulation       0.00      0.00      0.00        10
          M&A | Investments       0.00      0.00      0.00         7
                      Macro       0.00      0.00      0.00         3
                    Markets       0.00      0.00      0.00         5
                   Politics       0.00      0.00      0.00         0
           Stock Commentary       0.00      0.00      0.00         0
Treasuries | Corporate Debt       0.50      0.50      0.50         2

                   accuracy                           0.12       128
                  macro avg       0.09      0.14      0.09       128
               weighted avg       0.07      0.12      0.06       128

No description has been provided for this image
No description has been provided for this image

Model accuracy has not improved.

Many texts still get misclassified as General News | Opinion. We can check model performance if this topic is ignored.

InΒ [Β ]:
# reclassify the model
reclassify_finbert(
                  finbert_folder=output_data_folder,
                  label_dict=id2label,
                  chunking=True,
                  max_length=128,
                  exclude='General News | Opinion'
                  )

evaluate_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=True,
                          max_length=128,
                          appdx='_relabelled'
                          )
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
There are 128 texts in the evaluated dataset.
                             precision    recall  f1-score   support

     Company | Product News       0.41      0.94      0.57        17
                   Earnings       0.00      0.00      0.00        26
        Fed | Central Banks       0.10      0.29      0.14         7
                 Financials       0.14      0.03      0.04        39
     General News | Opinion       0.00      0.00      0.00        12
         Legal | Regulation       0.00      0.00      0.00        10
          M&A | Investments       0.00      0.00      0.00         7
                      Macro       0.03      0.33      0.06         3
                    Markets       0.00      0.00      0.00         5
                   Politics       0.00      0.00      0.00         0
           Stock Commentary       0.00      0.00      0.00         0
Treasuries | Corporate Debt       0.20      0.50      0.29         2

                   accuracy                           0.16       128
                  macro avg       0.07      0.17      0.09       128
               weighted avg       0.11      0.16      0.10       128

No description has been provided for this image

Ignoring General News | Opinion, the model has 16% accuracy, which is still better than the model with General News | Opinion, but slightly worse than the model with maxlength=512.


It is also worth checking the model logit distributions for the most frequent ground truth labels.

InΒ [Β ]:
for topic in ['Company | Product News', 'Earnings', 'Fed | Central Banks',
               'Financials', 'Legal | Regulation', 'M&A | Investments']:
    plot_model_logits(true_topic=topic,
                      data_folder=processed_data_folder,
                      finbert_folder=output_data_folder,
                      label_dict=id2label,
                      chunking=True,
                      max_length=128,
                      bank='JPMorgan'
                      )
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

As before, topics with the best logits (KDE peak above the chance level logit of 0) seem to be in good conceptual alignment with ground truth labels. Similarities between some KDE curves also illustrate the potential downsides of a winner-takes-all classification.

Therefore, all further analyses will be based on model logits, not discrete labels.

2.3.1.3 Using summarised textsΒΆ

A reason for texts being misclassified as "General News | Opinion" might be the conversational language used in the Q&A sections of the calls. To see if a change in tome might mitigate this flaw, we will apply the model to texts summarised using Phi-3.5.

InΒ [Β ]:
# checking token lengths of summarised texts
plot_finbert_token_lengths(processed_data_folder, finbert_topic_tokeniser, max_length=512, summarised=True,
                           appdx="_summarised")
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
Max token length is exceeded by 0 (0%) entries
No description has been provided for this image

As required by the prompt, all texts are really short! Chunking will not be necessary.

InΒ [Β ]:
# topic label prediction on summarised texts
process_finbert(
        finbert_tokeniser=finbert_topic_tokeniser,
        finbert_model=finbert_topic_model,
        save_folder=output_data_folder,
        label_dict=id2label,
        data_folder=processed_data_folder,
        chunking=False,
        max_length=512,
        qa_only=False,
        summarised=True,
        appdx="_summarised"
        )
InΒ [Β ]:
# evaluating model performance on summarised texts
evaluate_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=False,
                          max_length=512,
                          appdx='_summarised',
                          save_ground_truth=True
                          )
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
There are 108 texts in the evaluated dataset.
                             precision    recall  f1-score   support

     Company | Product News       0.38      0.88      0.53        16
                   Earnings       0.00      0.00      0.00        24
        Fed | Central Banks       0.16      0.50      0.24         6
                 Financials       0.36      0.14      0.21        28
     General News | Opinion       0.26      0.50      0.34        10
         Legal | Regulation       0.00      0.00      0.00         7
          M&A | Investments       0.00      0.00      0.00         7
                      Macro       0.00      0.00      0.00         3
                    Markets       1.00      0.20      0.33         5
Treasuries | Corporate Debt       0.50      0.50      0.50         2

                   accuracy                           0.26       108
                  macro avg       0.27      0.27      0.22       108
               weighted avg       0.24      0.26      0.20       108

No description has been provided for this image
No description has been provided for this image

Accuracy has increased to 26%! The model has become less prone to labelling texts as "General News | Opinion". Interestingly, many of its misclassifications now fall into the "Company | Product News" category.

Let us check if relabelling based on model logits, while ignoring "General News | Opinion", still improves accuracy.

InΒ [Β ]:
# reclassify the model
reclassify_finbert(
                  finbert_folder=output_data_folder,
                  label_dict=id2label,
                  chunking=False,
                  max_length=512,
                  exclude='General News | Opinion',
                  appdx='_summarised'
                  )

evaluate_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=False,
                          max_length=512,
                          appdx='_summarised_relabelled'
                          )
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
There are 108 texts in the evaluated dataset.
                             precision    recall  f1-score   support

     Company | Product News       0.32      0.94      0.48        16
                   Earnings       0.00      0.00      0.00        24
        Fed | Central Banks       0.14      0.50      0.22         6
                 Financials       0.36      0.14      0.21        28
     General News | Opinion       0.00      0.00      0.00        10
         Legal | Regulation       0.00      0.00      0.00         7
          M&A | Investments       0.00      0.00      0.00         7
                      Macro       0.00      0.00      0.00         3
                    Markets       1.00      0.20      0.33         5
Treasuries | Corporate Debt       0.33      0.50      0.40         2

                   accuracy                           0.22       108
                  macro avg       0.22      0.23      0.16       108
               weighted avg       0.20      0.22      0.16       108

No description has been provided for this image
No description has been provided for this image

Accuracy dropped to 22%. This makes sense because the model was clearly less prone to labelling texts as "General News | Opinion". Instead, the topic texts are most often misclassified as is "Company | Product News". We will therefore check how accuracy changes if this topic is removed.

InΒ [Β ]:
# reclassify the model
reclassify_finbert(
                  finbert_folder=output_data_folder,
                  label_dict=id2label,
                  chunking=False,
                  max_length=512,
                  exclude='Company | Product News',
                  appdx='_summarised'
                  )

evaluate_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=False,
                          max_length=512,
                          appdx='_summarised_relabelled'
                          )
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
Which bank's data do you wish to label?
['JPMorgan']JPMorgan
There are 108 texts in the evaluated dataset.
                             precision    recall  f1-score   support

     Company | Product News       0.00      0.00      0.00        16
                   Earnings       0.00      0.00      0.00        24
        Fed | Central Banks       0.16      0.50      0.24         6
                 Financials       0.29      0.18      0.22        28
     General News | Opinion       0.17      0.70      0.27        10
         Legal | Regulation       0.00      0.00      0.00         7
          M&A | Investments       0.00      0.00      0.00         7
                      Macro       0.05      0.33      0.08         3
                    Markets       1.00      0.20      0.33         5
                   Politics       0.00      0.00      0.00         0
Treasuries | Corporate Debt       0.33      0.50      0.40         2

                   accuracy                           0.17       108
                  macro avg       0.18      0.22      0.14       108
               weighted avg       0.15      0.17      0.12       108

No description has been provided for this image
No description has been provided for this image

Ignoring "Company | Product News", yielded 17% accuracy. The inadequacy of such blanket removal is not surprising given that this topic is genuinely quite prevalent in the dataset.

Overall, it is a good sign that ignoring whole labels no longer helps improve the accuracy.

As a last check, let us see the probability distributions of the most common topics.

InΒ [Β ]:
for topic in ['Company | Product News', 'Earnings', 'Fed | Central Banks',
               'Financials', 'Legal | Regulation', 'M&A | Investments']:
    plot_model_logits(true_topic=topic,
                      data_folder=processed_data_folder,
                      finbert_folder=output_data_folder,
                      label_dict=id2label,
                      chunking=False,
                      max_length=512,
                      bank='JPMorgan',
                      appdx='_summarised',
                      save_topic='Legal | Regulation'
                      )
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

As before, topic probability distributions illustrate that several topics can be associated with texts of a particular "true label", and the margins for misclassification are likely small. Thus, the use of probability distributions, rather than discrete labels, will provide a more complete view of the information contained in the texts.


Overall, we will proceed by using:

  • FinBERT output on summarised texts to analyse the Q&A portions of the transcripts because of the better performance;
  • FinBERT output on raw texts (max_length=512) to analyse the presentation portions of the transcripts, because executives' summaries cover a wide range of topics and a two-sentence summary of several pages of text does not capture this topic diversity adequately.

In both cases, we will use topic probability distributions, not discrete labels.

2.3.1.4 ComparisonΒΆ

InΒ [Β ]:
# modified evaluate_model_performance function for side by side output
def compare_model_performance(data_folder, finbert_folder, label_dict,
                               chunking=False, max_length=512, appdx="",
                               save_ground_truth=False):
    bank = 'JPMorgan'
    appdx_dict = {'_relabelled': 'Raw texts', '_summarised': 'Summarised texts'}

    # load ground truth data
    path_ground_truth = os.path.join(data_folder, f"ground_truth_{bank}.csv")
    df_ground_truth = pd.read_csv(path_ground_truth)

    # load model labels
    path_model = os.path.join(finbert_folder, f"finbert_topics_{bank}_chunking{chunking}_maxlength{max_length}{appdx}.csv")
    df_finbert = pd.read_csv(path_model)

    # merge the datasets on uid (multiple labels per uid possible)
    df_merged = df_finbert.merge(df_ground_truth, on=['uid'], how='left', suffixes=['_model', '_true'])

    # exclude rows with no ground truth
    df_merged = df_merged.dropna(axis=0, subset=['true_topic'])

    if save_ground_truth:
        logit_cols = [col for col in df_merged.columns if 'logit' in col]
        df_merged_short = df_merged.drop(['qa_text_model','qa_text_true',
                                          'true_sentiment', 'true_Q_outcome',
                                          'qa_num', 'finbert_topic_id', 'qa_type'] + logit_cols,
                                         axis=1)
        df_merged_short.to_csv(os.path.join(data_folder, f"finbert_topics_ground_truth_{bank}_chunking{chunking}_maxlength{max_length}_QA{appdx}.csv"),
                         index=False)

    # get the topics covered in labelled and ground truth datasets
    topics = np.unique(np.concatenate((df_merged['true_topic'],df_merged['finbert_topic_label'])))





    # plot confusion matrix - just one for presentation
    fig2, axes2 = plt.subplots(1,1, figsize=(9,8), sharey=True)
    conf_matrix = confusion_matrix(df_merged['true_topic'],
                                    df_merged['finbert_topic_label'],
                                    normalize='true')
    sns.heatmap(conf_matrix, ax=axes2, cmap='crest')
    fsize=19
    axes2.set_xticklabels(topics, rotation=90, fontsize=fsize)
    axes2.set_xlabel("Predicted label", fontsize=fsize)
    axes2.set_title("Confusion matrix", fontsize=fsize)
    axes2.set_ylabel("True label", fontsize=fsize)
    axes2.set_yticklabels(topics, rotation=0, fontsize=fsize)
    cbar = axes2.collections[0].colorbar
    cbar.ax.tick_params(labelsize=fsize)
    cbar.set_label("Fraction of true label", fontsize=fsize)
    axes2.set_title(appdx_dict[appdx], fontsize=40)
    plt.tight_layout()
    fig2.savefig(f"finbert-topic-{bank}-chunking{chunking}-maxlength{max_length}{appdx}.png",
                 dpi=300,bbox_inches="tight")


    # plot confusion matrix
    fig, axes = plt.subplots(1,2, figsize=(8,3), sharey=True)
    for i, (norm, tlt) in enumerate(zip(['pred', 'true'], ['Normalised by predicted', 'Normalised by true'])):
        conf_matrix = confusion_matrix(df_merged['true_topic'],
                                       df_merged['finbert_topic_label'],
                                       normalize=norm)
        sns.heatmap(conf_matrix, ax=axes[i], cmap='crest')
        axes[i].set_xticklabels(topics, rotation=90)
        axes[i].set_xlabel("Predicted label")
        axes[i].set_title(tlt)
    axes[0].set_ylabel("True label")
    axes[0].set_yticklabels(topics, rotation=0)

    plt.show()

    # get classification report
    report = classification_report(
                  df_merged['true_topic'],
                  df_merged['finbert_topic_label'],
                  zero_division=0
                  )
    print(report)
InΒ [Β ]:
out1 = Output()
out2 = Output()

with out1:
  compare_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=False,
                          max_length=512,
                          appdx='_relabelled'
                          )
with out2:
  compare_model_performance(
                          data_folder=processed_data_folder,
                          finbert_folder=output_data_folder,
                          label_dict=id2label,
                          chunking=False,
                          max_length=512,
                          appdx='_summarised',
                          save_ground_truth=True
                          )
display(HBox([out1, out2]))

2.3.2 BERTopicΒΆ

The flexibility of BERTopic allows us to choose the model for the different components of BERTopic. We are using PCA for dimension reduction, KMeans cllustering for clustering, and all-MiniLM-L6-v2 for the embedding model.

For the embedding model, the open-ai embedding and the Google universal-sentence-encoder were also used but the results difference is narrow. So we chose all-MiniLM-L6-v2 at the end for its performance speed.

InΒ [Β ]:
# load the embedding_model (same for all runs)
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

2.3.2.1 Pre-processing functionΒΆ

InΒ [Β ]:
def preprocess_spacy(text):
    # Load spaCy language model
    nlp = spacy.load("en_core_web_sm")

    # Define a custom list of stop words
    custom_stopwords = set([
        "and", "billions", "due", "jeremy", "text", "net", "qt", "billion", "million", "think", "thank", "year", "years",
        "month", "months", "yeah", "okay", "go", "ok", "hi", "good", "hey", "morning", "sure", "jamie", "jim", "like",
        "thing", "bit", "little", "key", "ceo", "got", "lot", "guy"
    ])

    # Define the target part-of-speech tags to keep and consolidate
    target_pos = {"NOUN"}  # In this example, we choose to consolidate nouns in their lemmatized form

    # Process the text as a spaCy document
    doc = nlp(text)

    # Create an empty set to store consolidated tokens
    consolidated_tokens = set()

    # Perform preprocessing
    for token in doc:
        lemma = token.lemma_.lower()  # Use lemmatized, lowercase form of the word

        # Filtering: remove default stop words, custom stop words, punctuation, whitespace, numbers, and single characters
        if (not token.is_stop                # Default stop words
            and lemma not in custom_stopwords  # Custom stop words
            and not token.is_punct           # Punctuation
            and not token.is_space           # Whitespace
            and not token.like_num           # Numbers
            and len(lemma) > 1):             # Single characters

            # If the part of speech is one of the target types (e.g., nouns), consolidate by adding the lemmatized form
            if token.pos_ in target_pos:
                consolidated_tokens.add(lemma)  # Add the lemmatized form of nouns to the set
            else:
                # Other parts of speech can be added as they are
                consolidated_tokens.add(token.text.lower())

    # Join tokens back to a string
    text = " ".join(consolidated_tokens)
    return text

2.3.2.2 Performing on summarised datasetΒΆ

We have already seen that summarisation greatly improve topic modelling. Hence we will run BERTopic on summarised dataset.

InΒ [Β ]:
# phi full table summarised data.xlsx
df_phi_fulltable_summarised = pd.read_excel(processed_data_folder + '/phi_fulltable_summarised.xlsx')

# phi_ground_truth_summarised data.xlsx
df_phi_ground_truth_summarised = pd.read_excel(processed_data_folder + '/phi_ground_truth_summarised.xlsx')
InΒ [Β ]:
# apply preprocessing
df_phi_fulltable_summarised_list = df_phi_fulltable_summarised['summarised_text'].apply(preprocess_spacy).to_list()

# running BERTopic
dim_model = PCA(n_components=5)
cluster_model = KMeans(n_clusters=20)

topic_model_phi_summarised = BERTopic(umap_model=dim_model, embedding_model=embedding_model,
                       hdbscan_model=cluster_model, calculate_probabilities=True)
topics, probabilities = topic_model_phi_summarised.fit_transform(df_phi_fulltable_summarised_list)
InΒ [Β ]:
# Plot each BERTtopic visualization into a subplot
topic_model_phi_summarised.visualize_topics().write_html("topic_model_phi_summarised_topic.html")
topic_model_phi_summarised.visualize_barchart(top_n_topics=20,n_words=8, autoscale=True).write_html("topic_model_phi_summarised_barchart.html")
topic_model_phi_summarised.visualize_heatmap().write_html("topic_model_phi_summarised_heatmap.html")
topic_model_phi_summarised.visualize_hierarchy().write_html("topic_model_phi_summarised_hierarchy.html")

# Load each plot into a subplot

#display(HTML("topic_model_phi_summarised_topic.html"))
display(HTML("topic_model_phi_summarised_barchart.html"))
display(HTML("topic_model_phi_summarised_heatmap.html"))
display(HTML("topic_model_phi_summarised_hierarchy.html"))

plt.tight_layout()
plt.show()

image.png image-2.png

2.4 Q/A Evasion and Generalisability of Phi-3.5ΒΆ

2.4.0 Using FinBERT classification for Q/A evasionΒΆ

Using the labels from FinBERT, we can perform a very basic analysis to see if the the topics of questions and answers match.

We will correlate the topic probability distributions of all question-answer pairs. This will create a distribution of correlation coefficients.

We will repeat this analysis on data with shuffled Q-A relationships.

InΒ [Β ]:
def plot_answers_to_questions(finbert_folder, data_folder,
                                label_dict, chunking=True, max_length=512,
                                bank="JPMorgan", datatype="QA", appdx="", save=False,
                              summarised=False, synthetic=False, exclude=None):

  df_merged, _, metadata = get_merged_data_for_plotting(finbert_folder, data_folder,
                                              label_dict, chunking=chunking, max_length=max_length,
                                              bank=bank, datatype=datatype, appdx=appdx,
                                              summarised=summarised, synthetic=synthetic)

  # subset QAs
  df_merged = df_merged[~df_merged['qa_num'].isna()].reset_index(drop=True).copy()

  # select only Qs that have As
  mask = df_merged.groupby("qa_num")["qa_type"].transform(lambda x: {'A', 'Q'}.issubset(set(x)))
  df_merged = df_merged[mask]

  # exclude General News | Opinion
  if exclude is not None:
      gen_key = [key for key,val in label_dict.items() if val==exclude][0]

      # define probability columns
      prob_cols = [col for col in df_merged.columns if 'prob' in col and f'topic_{gen_key}' not in col]
  else:
      prob_cols = [col for col in df_merged.columns if 'prob' in col]

  # define all columns to keep and subset the df
  subset_cols =prob_cols+ ['uid','qa_num', 'qa_type', 'finbert_topic_label']
  df_qa = df_merged[subset_cols]

  # subset answers
  df_answers = df_qa[df_qa['qa_type']=='A'].reset_index(drop=True).copy()
  df_questions = df_qa[df_qa['qa_type']=='Q'].reset_index(drop=True).copy()
  df_questions['qa_num_shuffled'] = df_questions['qa_num'].sample(frac=1).reset_index(drop=True)

  # add cols to fill
  df_answers['qa_corr'] = np.nan
  df_answers['qa_corr_shuffled'] = np.nan

  # iterate over answers
  for i, row in df_answers.iterrows():
      answer = np.array([row[col] for col in prob_cols])
      question = df_questions[df_questions['qa_num']==row.qa_num][prob_cols].values.flatten()
      question_shuffled = df_questions[df_questions['qa_num_shuffled']==row.qa_num][prob_cols].values.flatten()
      df_answers.at[i,'qa_corr'] = pearsonr(answer, question)[0]
      df_answers.at[i,'qa_corr_shuffled'] = pearsonr(answer, question_shuffled)[0]

  # stats
  _, pval = mannwhitneyu(df_answers['qa_corr'], df_answers['qa_corr_shuffled'], alternative='two-sided')
  print(f"p-val: {pval}")

  # plot
  fig, ax = plt.subplots(1, 1, figsize=(4,2))
  sns.kdeplot(df_answers['qa_corr'], ax=ax, color='#FE7F2D', alpha=0.6, label='Data', lw=2)
  sns.kdeplot(df_answers['qa_corr_shuffled'], ax=ax, color='#233D4D', alpha=0.6, label='Shuffled', lw=2, ls='dashed')

  ax.set_xlabel("Pearson correlation coefficient")
  plt.legend(loc='upper left', bbox_to_anchor=(1.02, 1))

  # save correlation data
  if save:
      df_for_saving = df_answers[['uid', 'qa_num', 'qa_corr']]
      if synthetic:
          df_for_saving['uid'] = ["_".join(x.split("_")[:-1]) for x in df_for_saving['uid']]
      path_for_saving = os.path.join(finbert_folder, f"finbert_QA_topic_correlations{appdx}.csv")
      df_for_saving.to_csv(path_for_saving, index=False)
InΒ [Β ]:
# Checking question evasiveness on raw text
plot_answers_to_questions(finbert_folder=output_data_folder, data_folder=processed_data_folder,
                                label_dict=id2label, chunking=False, max_length=512,
                                bank="JPMorgan", datatype="QA", appdx="",
                          exclude="General News | Opinion", save=True)
p-val: 1.2490241106466923e-19
No description has been provided for this image
InΒ [Β ]:
# Now - on text summarised by Phi-3.5
plot_answers_to_questions(finbert_folder=output_folder, data_folder=data_folder,
                                label_dict=id2label, chunking=False, max_length=512,
                                bank="JPMorgan", datatype="QA", appdx="_summarised",
                                summarised=True, save=True)
p-val: 3.095980672406378e-13
No description has been provided for this image

With both datasets, topic alignment is significantly better than chance for the true Q-A pairs than for shuffled pairs. Interestingly, text summarisation has resulted in equal-height peaks near Q-A topic correlation coefficients of 0 and 1, suggesting similar numbers of answered and avoided questions. Analysis of raw text was a bit more complimentary for the executives.

2.4.1 Phi-3.5 Pipeline initialisation and Helper functionsΒΆ

InΒ [Β ]:
# Initialise the pipeline - note T4 GPU does not contain enough RAM, so use CPU and ignore warning OR run on A100 GPU
pipe = pipeline("text-generation", model="microsoft/Phi-3.5-mini-instruct", trust_remote_code=True, device=0)

2.4.1.0 Preparing datasets for Phi 3.5 (with results from other analyses)ΒΆ

InΒ [Β ]:
"""
Aggregation function to merge Questions with their Answers
    - all answers to a question will be merged into one
    - each qa_text row will now start with the questions asked and then continue into all of the answers given
    - this allows for assessment of question evasion in the answers
    - assumes RoBERTa Sentiment and finBERT(qa_corr) columns already included
"""
# Function to aggregate Q and A pairs into single rows for question-evasion testing (test data only)
def aggregate_test_data(df):

  # Select columns to keep for test data
  df = df[['uid', 'qa_type', 'qa_num', 'qa_text', 'true_sentiment', 'true_topic', 'true_Q_outcome', 'RoBERTa Sentiment', 'qa_corr', 'qa_corr_summarised', 'finBERT Topic Classification', 'finBERT(summarised) Topic Classification']]

  # Convert 'qa_num' to numeric, forcing non-numeric values to NaN, then drop rows where 'qa_num' is NaN
  # Note this will exclude all 'N' type rows, which is required for QA pair analysis
  df['qa_num'] = pd.to_numeric(df['qa_num'], errors='coerce')
  df = df.dropna(subset=['qa_num'])

  # Convert qa_num, qa_corr to integers/floats and qa_text to strings
  df.loc[:, 'qa_num'] = df['qa_num'].astype(int)
  df.loc[:, 'qa_corr'] = df['qa_corr'].astype(float)
  df.loc[:, 'qa_corr_summarised'] = df['qa_corr_summarised'].astype(float)
  df.loc[:, 'qa_text'] = df['qa_text'].astype(str)


  # Group by 'qa_num', with different approach for each column
  def custom_aggregation(group):

    # Aggregate qa_text by joining texts for each question
    aggregated_text = ' '.join(group['qa_text'])

    # For true_Q_outcome, take the first value
    true_Q_outcome = group['true_Q_outcome'].iloc[0] if 'true_Q_outcome' in group else None

    # For qa_corrs, take the mean
    qa_corr_value = group['qa_corr'].mean() if 'qa_corr' in group else None
    qa_corr_value_summarised = group['qa_corr_summarised'].mean() if 'qa_corr_summarised' in group else None

    # For true_topic, true_sentiment, finBERT and RoBERTa Sentiment, only take values where qa_type is 'A'
    true_topic = group.loc[group['qa_type'] == 'A', 'true_topic'].unique().tolist() if 'true_topic' in group else []
    true_sentiment = group.loc[group['qa_type'] == 'A', 'true_sentiment'].unique().tolist() if 'true_sentiment' in group else []
    roberta_sentiment = group.loc[group['qa_type'] == 'A', 'RoBERTa Sentiment'].unique().tolist() if 'RoBERTa Sentiment' in group else []
    finBERT_topics_1 = group.loc[group['qa_type'] == 'A', 'finBERT Topic Classification'].unique().tolist() if 'finBERT Topic Classification' in group else []
    finBERT_topics_summarised = group.loc[group['qa_type'] == 'A', 'finBERT(summarised) Topic Classification'].unique().tolist() if 'finBERT(summarised) Topic Classification' in group else []

    # Print uids for manual review where multiple sentiments/topics in answer parts
    check_list = [true_sentiment, roberta_sentiment, true_topic, finBERT_topics_1, finBERT_topics_summarised]
    check_names = ['true_sentiment', 'RoBERTa Sentiment', 'true_topic', 'finBERT Topic Classification', 'finBERT(summarised) Topic Classification']
    manual_review_list = []

    for i, check in enumerate(check_list):
      if len(check) > 1:
        manual_review_list.append(group['uid'].iloc[0])

    # Remove duplicates and filter out empty strings from the list
    clean_manual_review_list = list(set([uid for uid in manual_review_list if uid]))
    print(clean_manual_review_list)

    # Convert lists to strings (removing brackets) for values expected to be singular values
    true_topic = ', '.join(true_topic)
    true_sentiment = ', '.join(true_sentiment)
    roberta_sentiment = ', '.join(roberta_sentiment)
    finBERT_topics_1 = ', '.join(finBERT_topics_1)
    finBERT_topics_summarised = ', '.join(finBERT_topics_summarised)


    # Return a dictionary of aggregated values
    return pd.Series({
        'uid': group['uid'].iloc[0].replace('_Q_', '_QA_'),
        'qa_type': 'QA',
        'qa_num': group['qa_num'].iloc[0],
        'qa_text': aggregated_text,
        'true_sentiment': true_sentiment,
        'true_topic': true_topic,
        'true_Q_outcome': true_Q_outcome,
        'RoBERTa Sentiment': roberta_sentiment,
        'qa_corr': qa_corr_value,
        'finBERT Topic Classification': finBERT_topics_1,
        'qa_corr_summarised': qa_corr_value_summarised,
        'finBERT(summarised) Topic Classification': finBERT_topics_summarised
    })


  # Apply custom aggregation
  aggregated_df = df.groupby('qa_num', as_index=False).apply(custom_aggregation).reset_index(drop=True)

  # Rename columns and set index to 'uid' for both dfs
  for table in [aggregated_df, df]:
    for col in table.columns:
      if col == 'true_sentiment': table.rename(columns={'true_sentiment': 'True Sentiment'}, inplace=True)
      if col == 'true_topic': table.rename(columns={'true_topic': 'True Topic'}, inplace=True)
      if col == 'true_Q_outcome': table.rename(columns={'true_Q_outcome': 'True Evasion Present'}, inplace=True)

  # Visualise DataFrames
  print("\nOriginal DataFrame:", len(df), "rows")
  display(df.head())

  print("\nAggregated QA DataFrame:", len(aggregated_df), "rows")
  display(aggregated_df.head())

  return df, aggregated_df
InΒ [Β ]:
"""
Aggregation function to merge Questions with their Answers
    - all answers to a question will be merged into one
    - each qa_text row will now start with the questions asked and then continue into all of the answers given
    - this allows for assessment of question evasion in the answers
"""
# Function to aggregate Q and A pairs into single rows for question-evasion testing (Non-test data only)
def aggregate_QA_data(df):

  # Convert 'qa_num' to numeric, forcing non-numeric values to NaN, then drop rows where 'qa_num' is NaN
  # Note this will exclude all 'N' type rows, which is required for QA pair analysis
  df.loc[:, 'qa_num'] = pd.to_numeric(df['qa_num'], errors='coerce')
  df = df.dropna(subset=['qa_num'])

  # Convert qa_num to integers and qa_text to strings
  df.loc[:, 'qa_num'] = df['qa_num'].astype(int)
  df.loc[:, 'qa_text'] = df['qa_text'].astype(str)

  # Check
  print("\nOriginal DataFrame:", len(df), "rows")
  display(df.head())


  # Only keep essential columns: 'uid', 'qa_type', 'qa_num', and 'qa_text' (common to all pre-processed files) for real data
  df = df[['uid', 'qa_type', 'qa_num', 'qa_text']]

  # Group by 'qa_num' and aggregate the 'qa_text' column by joining texts for each question
  df.loc[:, 'qa_text'] = df.groupby('qa_num')['qa_text'].transform(lambda x: ' '.join(x))

  # Drop duplicates to keep only one row per 'qa_num' with the aggregated text
  df = df.drop_duplicates(subset=['qa_num'])

  # Convert all qa_type entries to QA and 'uid' to QA
  df.loc[:, 'qa_type'] = 'QA'
  df['uid'] = df['uid'].str.replace('_Q_', '_QA_')

  # Set index to uid and drop index column
  df = df.set_index('uid', drop=True)

  # Check
  print("\nAggregated QA DataFrame:", len(df), "rows")
  display(df.head())

  return df
InΒ [Β ]:
# Function to calculate finBERT Evasion result from qa_corr or qa_corr_summarised value (if higher than 0.9, become Not Evasive, else Evasive)
def calculate_finBERT_evasion(df):

    # Convert qa_corr to float type
  if 'qa_corr' in df.columns:
    df['qa_corr'] = df['qa_corr'].astype(float)
    df['qa_corr'] = np.where(df['qa_corr'] > 0.9, 'Not Evasive', 'Evasive')
    df.rename(columns={'qa_corr': 'finBERT Evasion Present'}, inplace=True)

  if 'qa_corr_summarised' in df.columns:
    df['qa_corr_summarised'] = df['qa_corr_summarised'].astype(float)
    df['qa_corr_summarised'] = np.where(df['qa_corr_summarised'] > 0.9, 'Not Evasive', 'Evasive')
    df.rename(columns={'qa_corr_summarised': 'finBERT(summarised) Evasion Present'}, inplace=True)

  # Set index to uid and drop index column
  df = df.set_index('uid', drop=True)

  return df

2.4.1.1 PromptsΒΆ

InΒ [Β ]:
# Define prompt questions for Evasion, Sentiment and Topic analysis with Phi-3.5 and Q&A tables of pre-processed financial transcripts

# All questions (for evasion to run, must be aggregated dataset, otherwise it will ignore evasion question)
combined_questions = [
    f"1. Analyze the following financial question-answer pairs. Determine if the answer is evasive or non-evasive. Consider the following factors:\n"
    "1. Deception: Is the answer ambiguous, misleading or trying to conceal something? 2. Relevance: Is the information provided relevant to the question? 3. Specificity: Is the answer specific enough to provide an informative response in a financial context?\n\n"
    "Example of an Evasive answer: What is the company's strategy for addressing the rising costs of raw materials? We are taking positive steps to mitigate the impact of rising costs. Explanation: This is evasive because the answer does not give any detail about the strategy. \n\n"
    "Example of a non-evasive answer: How will the recent interest rate hike impact the bank's net interest margin? The recent interest rate hike is expected to positively impact the bank's net interest margin, as the bank's interest-earning assets tend to reprice faster than its interest-bearing liabilities. Explanation: This is a non-evasive answer as it clearly states the impact.\n\n"
    "If evasion is detected, classify it as Evasive, with a level of 'Low', 'Moderate', or 'High'. For each instance of evasion, identify up to 5 specific topics that were not adequately addressed.\n"
    "Each theme should be a short keyword or phrase summarizing the topic, e.g., '2024 consensus,' 'NII Markets,' 'Asset-Sensitive Number.'\n"
    "Your response should use the format:\n\n"
    "Evasive: [Low/Moderate/High]; [topic1, topic2, topic3]\n\n",

    f"2. Categorise the sentiment in this text as positive, negative, or neutral. Provide relative percentage scores and up to 5 concise keywords or themes contributing to each sentiment, suitable for direct input into a table. Use the following format:\n\n"
    "Positive: %; [theme1, theme2, theme3]\n"
    "Negative: %; [theme1, theme2, theme3]\n"
    "Neutral: %; [theme1, theme2, theme3]\n"
    "Each theme should be a short keyword or phrase summarizing the topic, e.g., 'NII,' '2024 consensus,' or 'Asset-Sensitive Number.'\n"

    f"3. Identify the primary topic discussed in the following financial transcript and classify it as one of these categories: "
    "Earnings, Financials, M&A | Investments, General News | Opinion, Fed | Central Banks, Company | Product News, Markets, Treasuries | Corporate Debt, Legal | Regulation, Macro, Energy | Oil, Currencies, Analyst Update, IPO, Dividend, Politics, Gold | Metals | Materials, Stock Movement, Personnel Change, Stock Commentary.\n"
    "Do not make up other categories, assign it to the best fit from the categories provided. Your final response must use the format:\n\n"
    "finBERT Topic: [category]\n\n"
    ]

# Sentiment only
sentiment_questions = [
    f"1. Categorise the financial sentiment in this text as positive, negative, or neutral. Provide relative percentage scores and up to 5 concise keywords or themes contributing to each sentiment, suitable for direct input into a table. Use the following format:\n\n"
    "Positive: %; [theme1, theme2, theme3]\n"
    "Negative: %; [theme1, theme2, theme3]\n"
    "Neutral: %; [theme1, theme2, theme3]\n"
    "Each theme should be a short keyword or phrase summarizing the topic, e.g., 'NII,' '2024 consensus,' or 'Asset-Sensitive Number.'\n",
]

# Evasion only (needs aggregated QA dataset)
evasion_questions = [
    f"Analyze the following financial question-answer pairs. Determine if the answer is evasive or non-evasive. Consider the following factors:\n"
    "1. Deception: Is the answer ambiguous, misleading or trying to conceal something? 2. Relevance: Is the information provided relevant to the question? 3. Specificity: Is the answer specific enough to provide an informative response in a financial context?\n\n"
    "Example of an Evasive answer: What is the company's strategy for addressing the rising costs of raw materials? We are taking positive steps to mitigate the impact of rising costs. Explanation: This is evasive because the answer does not give any detail about the strategy. \n\n"
    "Example of a non-evasive answer: How will the recent interest rate hike impact the bank's net interest margin? The recent interest rate hike is expected to positively impact the bank's net interest margin, as the bank's interest-earning assets tend to reprice faster than its interest-bearing liabilities. Explanation: This is a non-evasive answer as it clearly states the impact.\n\n"
    "If evasion is detected, classify it as Evasive, with a level of 'Low', 'Moderate', or 'High'. For each instance of evasion, identify up to 5 specific topics that were not adequately addressed.\n"
    "Each theme should be a short keyword or phrase summarizing the topic, e.g., '2024 consensus,' 'NII Markets,' 'Asset-Sensitive Number.'\n"
    "Your response should use the format:\n\n"
    "Evasive: [Low/Moderate/High]; [topic1, topic2, topic3]\n\n"
]

# Topic assignment to match finBERT topics
finBERT_topic_questions = [
    f"Identify the primary topic discussed in the following financial transcript and classify it as one of these categories: "
    "Earnings, Financials, M&A | Investments, General News | Opinion, Fed | Central Banks, Company | Product News, Markets, Treasuries | Corporate Debt, Legal | Regulation, Macro, Energy | Oil, Currencies, Analyst Update, IPO, Dividend, Politics, Gold | Metals | Materials, Stock Movement, Personnel Change, Stock Commentary.\n"
    "Do not make up other categories, assign it to the best fit from the categories provided. Your final response must use the format:\n\n"
    "finBERT Topic: [category]\n\n"
]

2.4.1.2 Functions to run analyses with Phi-3.5ΒΆ

InΒ [Β ]:
"""
Financial keywords to search for in text are defined here
"""
keyword_search_terms = ["SS3/24", "SS3", "SS three", "supervisory statement", "PRA", "credit risk definition of default", "definition of default", "Bank of England", "BoE", "England", "Basel", "Basel III", "Basel IV", "S9/24", "S9", "S nine"]


"""
Function to search provided text for financial keywords
  - this can be done by whole pdf (as text file) or row by row of Q&A table
"""
# Search for keywords of interest
def keyword_search(text, keywords):
    found_terms = [term for term in keywords if term.lower() in text.lower()]
    if found_terms:
        term_list = ', '.join(found_terms)
        return term_list
    else:
        return "None"
InΒ [Β ]:
"""
Function to run phi-3.5 on tabular questions and answers.
  This will take as input either the aggregated Q&A pairs (for combined or separate question-evasion and sentiment/topic analysis) or separate Q, A and N rows table (for sentiment and topic analysis only)
  The prompt used can vary but must produce the following output format (where applicable for Sentiment or Evasion analysis):
   "Positive: %; [theme1, theme2, theme3]
    Negative: %; [theme1, theme2, theme3]
    Neutral: %; [theme1, theme2, theme3]
    Each theme should be a short keyword or phrase summarizing the topic, e.g., 'NII,' '2024 consensus,' or 'Asset-Sensitive Number.'
    Evasive: [Low/Moderate/High]; [topic1, topic2, topic3]
    Each theme should be a short keyword or phrase summarizing the topic, e.g., '2024 consensus,' 'NII Markets,' 'Asset-Sensitive Number.''
   "

  The non-aggregated 'separate Q, A and N rows table' input option is intended only for ground-truth assessment of performance against other methods such as BERTopic, finBERT and RoBERTa.
  Analysis path is determined by presence of 'sentiment' and/or 'Evasive' in the prompt
  There is a sense check against non-aggregated data being accidentally input with the evasive prompt (as this only works with aggregated data).
"""

# Function to run phi-3.5 on tabular questions and answers
def phi_question_answer(input_df, prompt_questions, input_aggregated='N', input_col='qa_text'):

  print("input_aggregated:", input_aggregated)

  # Set up progress counter and timer
  start_time = time.time()
  total_count = len(input_df)
  x = 0

  # Determine table columns based on the questions asked
  columns = ["uid", "keywords"]
  if any("Evasive" in q for q in prompt_questions):
      columns.extend(["Phi-3.5 Evasion Present", "Phi-3.5 Evasion Degree", "Phi-3.5 Evaded Topics"])

      # Check that aggregated flag is on for evasion input data - if not exit function with warning
      if input_aggregated == 'N':
        print("Error: Evasion question present in prompt but non-aggregated dataset provided. Analysis stopped.")
        return None

  if any("sentiment" in q for q in prompt_questions):
      columns.extend(["Phi-3.5 Sentiment", "Phi-3.5 Positive %", "Phi-3.5 Negative %", "Phi-3.5 Neutral %", "Phi-3.5 Positive Topics", "Phi-3.5 Negative Topics", "Phi-3.5 Neutral Topics"])

  if any("finBERT" in q for q in prompt_questions):
      columns.extend(["Phi-3.5 Topic Classification"])


  # Initialise PrettyTable with dynamic columns
  table = PrettyTable(columns)

  # Loop through rows in phi_input_df and apply keyword search and Phi3.5 questions to each row
  for i, row in input_df.iterrows():

      # Iterate counter
      x += 1

      # Check for keywords
      text = row[input_col]
      keywords = keyword_search(text, keyword_search_terms)
      if keywords != 'None':
        print(f"Keywords found: {keywords}")
      if x == 1 or x % 5 == 0:
        print(f"{x}/{total_count} ...")

      # Initialise variables to store extracted data (depending on questions asked)
      row_data = {"uid": i, "keywords": keywords}
      pos_value, neg_value, neut_value = 0, 0, 0

      if "Phi-3.5 Positive %" in columns:
          row_data.update({"Phi-3.5 Sentiment": "N/A", "Phi-3.5 Positive %": "N/A", "Phi-3.5 Negative %": "N/A", "Phi-3.5 Neutral %": "N/A", "Phi-3.5 Positive Topics": "N/A", "Phi-3.5 Negative Topics": "N/A", "Phi-3.5 Neutral Topics": "N/A"})
      if "Phi-3.5 Evasion Present" in columns:
          row_data.update({"Phi-3.5 Evasion Present": "N/A", "Phi-3.5 Evasion Degree": "N/A", "Phi-3.5 Evaded Topics": "N/A"})
      if "Phi-3.5 Topic Classification" in columns:
          row_data.update({"Phi-3.5 Topic Classification": "N/A"})


      # Answer prompt questions with Phi-3.5 on text from each row
      for question in prompt_questions:
          prompt = f"Context: {text}\n\nQuestion: {question}\nAnswer:"
          output = pipe(prompt, max_new_tokens=130, min_length=20, do_sample=False, eos_token_id=pipe.tokenizer.eos_token_id)
          answer = output[0]["generated_text"].replace(prompt, "").strip()
          #print(answer)

          # Extract Topic Classification information from the answer, if finBERT classification question asked
          if "finBERT" in question:
            if 'finBERT Topic:' in answer:
              row_data["Phi-3.5 Topic Classification"] = answer.split('Topic: ')[1].split('\n')[0].strip() if 'Topic: ' in answer else "N/A"

          # Extract sentiment-related information from the answer, if sentiment question asked
          if "sentiment" in question:
              if "Positive:" in answer:
                pos_part = answer.split("Positive: ")[1]
                pos_value = float(pos_part.split(";")[0].strip().replace("%", "")) if pos_part else 0
                row_data["Phi-3.5 Positive %"] = pos_value
                row_data["Phi-3.5 Positive Topics"] = pos_part.split("[")[1].split("]")[0].strip() if "[" in pos_part and "]" in pos_part else "N/A"

              if "Negative:" in answer:
                neg_part = answer.split("Negative: ")[1]
                neg_value = float(neg_part.split(";")[0].strip().replace("%", "")) if neg_part else 0
                row_data["Phi-3.5 Negative %"] = neg_value
                row_data["Phi-3.5 Negative Topics"] = neg_part.split("[")[1].split("]")[0].strip() if "[" in neg_part and "]" in neg_part else "N/A"

              if "Neutral:" in answer:
                neutral_part = answer.split("Neutral: ")[1]
                neut_value = float(neutral_part.split(";")[0].strip().replace("%", "")) if neutral_part else 0
                row_data["Phi-3.5 Neutral %"] = neut_value
                row_data["Phi-3.5 Neutral Topics"] = neutral_part.split("[")[1].split("]")[0].strip() if "[" in neutral_part and "]" in neutral_part else "N/A"

              # Determine the overall sentiment based on the highest percentage value, checking for "Neutral" first, with logic conflicts also resulting in Neutral
              if neut_value >= pos_value and neut_value >= neg_value:
                row_data["Phi-3.5 Sentiment"] = "Neutral"
              elif neg_value >= pos_value and neg_value >= neut_value:
                row_data["Phi-3.5 Sentiment"] = "Negative"
              elif pos_value > neg_value and pos_value > neut_value:
                row_data["Phi-3.5 Sentiment"] = "Positive"
              else:
                row_data["Phi-3.5 Sentiment"] = "Neutral"

          # Extract question evasion determination from the answer, if evasion question asked
          if "Evasive" in question:
            if "Evasive" in answer:
              row_data["Phi-3.5 Evasion Present"] = "Evasive"

              # If Evasion degree is low, set to Non-Evasive
              if "Low" in answer:
                row_data["Phi-3.5 Evasion Degree"] = "Low"
                row_data["Phi-3.5 Evasion Present"] = "Not Evasive"

              elif "Moderate" in answer: row_data["Phi-3.5 Evasion Degree"] = "Moderate"
              elif "High" in answer: row_data["Phi-3.5 Evasion Degree"] = "High"
            else:
              row_data["Phi-3.5 Evasion Present"] = "Not Evasive"
              row_data["Phi-3.5 Evasion Degree"] = "None"
            row_data["Phi-3.5 Evaded Topics"] = answer.split('[')[1].split(']')[0].strip() if '[' in answer and ']' in answer else "N/A"

      # Add the data to the PrettyTable
      table.add_row([row_data[col] for col in columns])


  # Convert PrettyTable to a DataFrame and merge with original table
  phi_temp_df = pd.DataFrame(table.rows, columns=table.field_names)
  phi_final_df = pd.merge(input_df, phi_temp_df, left_index=True, right_on="uid", how="outer")
  phi_final_df.set_index('uid', inplace=True)
  phi_final_df = phi_final_df.sort_values(by='qa_num')

  # View df
  display(phi_final_df.head())

  # Calculate time for full dataset
  end_time = time.time()
  time_taken = end_time - start_time

  print(f"Time taken for {len(phi_final_df)} QA: {round(time_taken/60, 2)} minutes")

  if input_aggregated == 'Y':
    print(f"Estimate for all transcripts: {round((time_taken/len(phi_final_df) * 375)/60/60, 2)} hours")
  else:
    print(f"Estimate for all transcripts: {round((time_taken/len(phi_final_df) * 927)/60/60, 2)} hours")

  return phi_final_df

2.4.1.2 Plot functionsΒΆ

InΒ [Β ]:
"""
Functions to calculate accuracy metrics for sentiment and evasion comparisons and display barcharts and confusion matrices for Phi results
"""

# Function to compare all columns with 'Sentiment' or 'Evasion Present' in the column name against 'True Sentiment' or 'True Evasion Present' column for each row
def calculate_accuracy(df, sentiment_flag='Y', evasion_flag='Y', topic_flag='Y'):

  accuracy_results = {
        'Sentiment Accuracy': {},
        'Evasion Accuracy': {},
        'Topic Accuracy': {}
  }

  threshold_results = {
      'Sentiment Threshold': 0.33,
      'Evasion Threshold': 0.5,
      'Topic Threshold': 0.2
  }

  num_sentiments = 3
  num_evasion = 2
  num_topics = 20

  # Find all columns that contain 'Sentiment' and compare them against 'True Sentiment'
  sentiment_columns = [col for col in df.columns if 'sentiment' in col.lower() and col != 'True Sentiment']

  if sentiment_columns:
    for col in sentiment_columns:
      sentiment_accuracy = ((df[col].astype(str).str.lower() == df['True Sentiment'].astype(str).str.lower()).sum() / len(df)) * 100
      accuracy_results['Sentiment Accuracy'][col] = round(sentiment_accuracy, 2)
      unique_values = df[col].unique()
    threshold_results['Sentiment Threshold'] = round((100 / num_sentiments), 2)
  else:
    print("No sentiment columns found.")


  # Find all columns that contain 'Evasion Present' and compare them against 'True Evasion Present'
  evasion_columns = [col for col in df.columns if 'evasion present' in col.lower() and col != 'True Evasion Present']

  if evasion_columns:
    for col in evasion_columns:
      evasion_accuracy = ((df[col].astype(str).str.lower() == df['True Evasion Present'].astype(str).str.lower()).sum() / len(df)) * 100
      accuracy_results['Evasion Accuracy'][col] = round(evasion_accuracy, 2)
      unique_values = df[col].unique()
    threshold_results['Evasion Threshold'] = round((100 / num_evasion), 2)
  else:
    print("No evasion columns found.")

  # Find all columns that contain 'Classification' and compare them against 'True Topic'
  topic_columns = [col for col in df.columns if 'classification' in col.lower() and col != 'True Topic']

  if topic_columns:
    for col in topic_columns:
      topic_accuracy = ((df[col].astype(str).str.lower() == df['True Topic'].astype(str).str.lower()).sum() / len(df)) * 100
      accuracy_results['Topic Accuracy'][col] = round(topic_accuracy, 2)
      unique_values = df[col].unique()
    threshold_results['Topic Threshold'] = round((100 / num_topics), 2)
  else:
    print("No topic classification columns found.")


  # Print the accuracy results
  if accuracy_results['Topic Accuracy'] and topic_flag == 'Y':
    print(f"\nTopic Accuracy Results (Significance Threshold {threshold_results['Topic Threshold']}%):")
    for method, acc in accuracy_results['Topic Accuracy'].items():
      print(f"{method}: {acc}%")

  if accuracy_results['Sentiment Accuracy'] and sentiment_flag == 'Y':
    print(f"\nSentiment Accuracy Results (Significance Threshold {threshold_results['Sentiment Threshold']}%):")
    for method, acc in accuracy_results['Sentiment Accuracy'].items():
      print(f"{method}: {acc}%")

  if accuracy_results['Evasion Accuracy'] and evasion_flag == 'Y':
    print(f"\nEvasion Accuracy Results (Significance Threshold {threshold_results['Evasion Threshold']}%):")
    for method, acc in accuracy_results['Evasion Accuracy'].items():
      print(f"{method}: {acc}%")

  plot_accuracy_charts(accuracy_results, df, threshold_results, sentiment_flag, evasion_flag, topic_flag)

  return accuracy_results



# Function to plot separate accuracy barcharts for sentiment and evasion comparisons
def plot_accuracy_charts(accuracy_results, df, threshold_results, sentiment_flag='Y', evasion_flag='Y', topic_flag='Y'):
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    plot_idx = 0
    threshold_line_colour = 'darkgrey'

     # Plot Topic Accuracy
    a2 = 0
    if accuracy_results['Topic Accuracy'] and topic_flag == 'Y':
      plot_name = 'Topic Accuracy'
      threshold = threshold_results.get('Topic Threshold', 0)
      labels = [label.split()[0] for label in accuracy_results[plot_name].keys()]
      axes[0, a2].bar(accuracy_results[plot_name].keys(), accuracy_results[plot_name].values(), color='darkgreen')
      axes[0, a2].set_title(plot_name, fontsize=16)
      axes[0, a2].set_ylabel('Accuracy %', fontsize=14)
      axes[0, a2].set_ylim(0, 100)
      axes[0, a2].set_xticks(range(len(accuracy_results[plot_name])))
      axes[0, a2].set_xticklabels(labels, rotation=45, ha='right')
      axes[0, a2].tick_params(axis='y', labelsize=14)
      axes[0, a2].tick_params(axis='x', labelsize=14)

      # Add a dashed horizontal line at the chance threshold
      axes[0, a2].axhline(y=threshold, color=threshold_line_colour, linestyle='--')

      # Annotate each bar with the accuracy value
      for i, (label, value) in enumerate(accuracy_results[plot_name].items()):
        axes[0, a2].text(i, value -1, f'{value:.1f}%', ha='center', va='top', fontsize=12, color='white')

    else:
      axes[0, a2].axis('off')


    # Plot Sentiment Accuracy
    a2 = 1
    if accuracy_results['Sentiment Accuracy'] and sentiment_flag == 'Y':
      plot_name = 'Sentiment Accuracy'
      threshold = threshold_results.get('Sentiment Threshold', 0)
      labels = [label.split()[0] for label in accuracy_results[plot_name].keys()]
      axes[0, a2].bar(accuracy_results[plot_name].keys(), accuracy_results[plot_name].values(), color='royalblue')
      axes[0, a2].set_title(plot_name, fontsize=16)
      axes[0, a2].set_ylabel('Accuracy %', fontsize=14)
      axes[0, a2].set_ylim(0, 100)
      axes[0, a2].set_xticks(range(len(accuracy_results[plot_name])))
      axes[0, a2].set_xticklabels(labels, rotation=45, ha='right')
      axes[0, a2].tick_params(axis='y', labelsize=14)
      axes[0, a2].tick_params(axis='x', labelsize=14)

      # Add a dashed horizontal line at the chance threshold
      axes[0, a2].axhline(y=threshold, color=threshold_line_colour, linestyle='--')

      # Annotate each bar with the accuracy value
      for i, (label, value) in enumerate(accuracy_results[plot_name].items()):
        axes[0, a2].text(i, value -1, f'{value:.1f}%', ha='center', va='top', fontsize=12, color='white')

    else:
      axes[0, a2].axis('off')


    # Plot Evasion Accuracy
    a2 = 2
    if accuracy_results['Evasion Accuracy'] and evasion_flag == 'Y':
      plot_name = 'Evasion Accuracy'
      threshold = threshold_results.get('Evasion Threshold', 0)
      labels = [label.split()[0] for label in accuracy_results[plot_name].keys()]
      axes[0, a2].bar(accuracy_results[plot_name].keys(), accuracy_results[plot_name].values(), color='brown')
      axes[0, a2].set_title(plot_name, fontsize=16)
      axes[0, a2].set_ylabel('Accuracy %', fontsize=14)
      axes[0, a2].set_ylim(0, 100)
      axes[0, a2].set_xticks(range(len(accuracy_results[plot_name])))
      axes[0, a2].set_xticklabels(labels, rotation=45, ha='right')
      axes[0, a2].tick_params(axis='y', labelsize=14)
      axes[0, a2].tick_params(axis='x', labelsize=14)

      # Add a dashed horizontal line at the chance threshold
      axes[0, a2].axhline(y=threshold, color=threshold_line_colour, linestyle='--')

      # Annotate each bar with the accuracy value
      for i, (label, value) in enumerate(accuracy_results[plot_name].items()):
        axes[0, a2].text(i, value -1, f'{value:.1f}%', ha='center', va='top', fontsize=12, color='white')

    else:
      axes[0, a2].axis('off')


    # Find the model with the highest accuracy for each category
    best_sentiment_model = max(accuracy_results['Sentiment Accuracy'], key=accuracy_results['Sentiment Accuracy'].get, default=None)
    best_evasion_model = max(accuracy_results['Evasion Accuracy'], key=accuracy_results['Evasion Accuracy'].get, default=None)
    best_topic_model = max(accuracy_results['Topic Accuracy'], key=accuracy_results['Topic Accuracy'].get, default=None)

    # Plot confusion matrix for best Topic Classification model
    a2 = 0
    if 'True Topic' in df.columns and best_topic_model in df.columns and topic_flag == 'Y':
      cm_best_topics = confusion_matrix(df['True Topic'], df[best_topic_model], labels=df['True Topic'].unique())
      sns.heatmap(cm_best_topics, annot=True, fmt='d', cmap='Greens', ax=axes[1, a2], xticklabels=df['True Topic'].unique(), yticklabels=df['True Topic'].unique(), annot_kws={"size": 16})
      axes[1, a2].set_title(f'{best_topic_model.split()[0]}: Best Model\nTopic Classification', fontsize=14)
      axes[1, a2].set_xlabel('Predicted', fontsize=14)
      axes[1, a2].set_ylabel('True', fontsize=14)
      axes[1, a2].tick_params(axis='y', labelsize=12)
      axes[1, a2].tick_params(axis='x', labelsize=12)
    else:
      axes[1, a2].axis('off')

    # Plot confusion matrix for best Sentiment model
    a2 = 1
    if 'True Sentiment' in df.columns and best_sentiment_model in df.columns and sentiment_flag == 'Y':
      cm_best_sentiment = confusion_matrix(df['True Sentiment'], df[best_sentiment_model], labels=df['True Sentiment'].unique())
      sns.heatmap(cm_best_sentiment, annot=True, fmt='d', cmap='Blues', ax=axes[1, a2], xticklabels=df['True Sentiment'].unique(), yticklabels=df['True Sentiment'].unique(), annot_kws={"size": 16})
      axes[1, a2].set_title(f'{best_sentiment_model.split()[0]}: Best Model\nSentiment Analysis', fontsize=14)
      axes[1, a2].set_xlabel('Predicted', fontsize=14)
      axes[1, a2].set_ylabel('True', fontsize=14)
      axes[1, a2].tick_params(axis='y', labelsize=12)
      axes[1, a2].tick_params(axis='x', labelsize=12)
    else:
      axes[1, a2].axis('off')

    # Plot confusion matrix for best Evasion model
    a2 = 2
    if 'True Evasion Present' in df.columns and best_evasion_model in df.columns and evasion_flag == 'Y':
      cm_best_evasion = confusion_matrix(df['True Evasion Present'], df[best_evasion_model], labels=df['True Evasion Present'].unique())
      sns.heatmap(cm_best_evasion, annot=True, fmt='d', cmap='Reds', ax=axes[1, a2], xticklabels=df['True Evasion Present'].unique(), yticklabels=df['True Evasion Present'].unique(), annot_kws={"size": 16})
      axes[1, a2].set_title(f'{best_evasion_model.split()[0]}: Best Model\nQuestion Evasion', fontsize=14)
      axes[1, a2].set_xlabel('Predicted', fontsize=14)
      axes[1, a2].set_ylabel('True', fontsize=14)
      axes[1, a2].tick_params(axis='y', labelsize=12)
      axes[1, a2].tick_params(axis='x', labelsize=12)
    else:
      axes[1, a2].axis('off')




    plt.tight_layout()
    plt.show()

2.4.2 QA Evasion, Sentiment Analysis and Topic Modelling on Synthetic Data setΒΆ

2.4.2.0 Preparing synthetic data tablesΒΆ

2.4.2.0.1 Running RoBERTa on the synthetic datasetΒΆ
InΒ [Β ]:
def get_synthetic(data_folder):
  data = pd.read_csv(data_folder + "Synthetic_Data.csv")
  data['true_sentiment'] = data['True Sentiment'].str.lower()
  return data

# load the synthetic dataset
synthetic_data = get_synthetic(processed_data_folder)
synthetic_data['true_score'] = synthetic_data['true_sentiment'].map(score_dict)

# get response from RoBERTa
roberta_model = ClassificationModel("financial-roberta-large", "soleimanian/financial-roberta-large-sentiment")
synthetic_data = roberta_model.get_model_response_for_df(synthetic_data, 'qa_text')
synthetic_data.rename(columns={'financial-roberta-large_sentiment':'RoBERTa Sentiment'}, inplace=True)
synthetic_data = synthetic_data[['uid', 'RoBERTa Sentiment']]

# save the result
synthetic_data.to_csv(output_data_folder + '/sentiment_eval_result_synthetic.csv')
2.4.2.0.2 Running Finbert on the synthetic datasetΒΆ
InΒ [Β ]:
def preprocess_synthetic_data(data_folder):
    df_synthetic = pd.read_csv(data_folder + "/Synthetic_Data.csv")

    df_synthetic_processed = pd.DataFrame(columns = ['uid', 'qa_type', 'qa_num', 'qa_text'])
    for row in df_synthetic.itertuples():
        q,a = row.qa_text.split("?")
        q += '?'
        rows_to_append = pd.DataFrame({
            'uid': [row.uid+'_Q', row.uid+'_A'],
            'qa_type': ['Q','A'],
            'qa_num': [row.qa_num]*2,
            'qa_text': [q, a]
        })
        df_synthetic_processed = pd.concat([df_synthetic_processed, rows_to_append], ignore_index=True)

    save_path = os.path.join(data_folder, f"synthetic_data_for_finbert.csv")
    df_synthetic_processed.to_csv(save_path, index=False)
InΒ [Β ]:
preprocess_synthetic_data(processed_data_folder)
InΒ [Β ]:
process_finbert(
        finbert_tokeniser=finbert_topic_tokeniser,
        finbert_model=finbert_topic_model,
        save_folder=output_data_folder,
        label_dict=id2label,
        data_folder=processed_data_folder,
        chunking=False,
        max_length=512,
        qa_only=True,
        synthetic=True,
        appdx='_synthetic'
        )
InΒ [Β ]:
# Compute question evasion
plot_answers_to_questions(finbert_folder=output_data_folder,
                          data_folder=processed_data_folder,
                          label_dict=id2label,
                          chunking=False,
                          max_length=512,
                          bank="JPMorgan",
                          datatype="QA",
                          appdx="_synthetic",
                          synthetic=True,
                          save=True)

image.png

2.4.2.0.3 Create synthetic data tableΒΆ
InΒ [Β ]:
# Load Synthetic data files (change to filepaths in Inputs/Outputs folder later)
synthetic_QA_df = pd.read_excel(processed_data_folder + "Synthetic_QA_dataset.xlsx")

# Get RoBERTa synthetic output file
synthetic_Roberta_result = pd.read_csv(processed_data_folder + "sentiment_eval_result_synthetic.csv")

# Get finBERT synthetic output files
synthetic_finBERT_result_corr = pd.read_csv(output_data_folder + "finbert_QA_topic_correlations_synthetic.csv")
synthetic_finBERT_result_topics = pd.read_csv(output_data_folder + "finbert_topics_JPMorgan_chunkingFalse_maxlength512_synthetic.csv")
InΒ [Β ]:
# Merge RoBERTa Sentiment column into Synthetic results table
synthetic_QA_df = pd.merge(synthetic_QA_df, synthetic_Roberta_result[['uid', 'RoBERTa Sentiment']], on='uid', how='left')

# Drop _Q entries in uid column from synthetic_finBERT_result_topics and remove _A to match uid in other tables
synthetic_finBERT_result_topics = (
    synthetic_finBERT_result_topics[~synthetic_finBERT_result_topics['uid'].str.contains('_Q')]
    .assign(uid=lambda x: x['uid'].str.replace('_A', '', regex=False))
)

# Merge specific columns from finBERT tables into Synthetic results table
synthetic_QA_df = pd.merge(synthetic_QA_df, synthetic_finBERT_result_corr[['uid', 'qa_corr']], on='uid', how='left')
synthetic_QA_df = pd.merge(synthetic_QA_df, synthetic_finBERT_result_topics[['uid', 'finbert_topic_label']], on='uid', how='left')
synthetic_QA_df.rename(columns={'finbert_topic_label': 'finBERT Topic Classification'}, inplace=True)

# Calculate evasion for finBERT on synthetic data (also re-sets index to uid)
synthetic_QA_df = calculate_finBERT_evasion(synthetic_QA_df)

2.4.2.1 Running Phi-3.5 on synthetic data setΒΆ

InΒ [Β ]:
"""
  Assess accuracy of Phi3.5 with provided prompt against GPT4-generated synthetic dataset,
  with benchmarking against 'True' human-confirmed labels and finBERT/RoBERTa.

A small synthetic financial Q&A dataset with combined/aggregated questions and answers in a 'qa_text' column
Question-answer pairs were generated by manual adaption of GPT-4o-mini responses
  4x Positive sentiment with: No Evasion, Low Evasion, Moderate Evasion, High Evasion
  4x Negative sentiment with: No Evasion, Low Evasion, Moderate Evasion, High Evasion
  4x Neutral sentiment with: No Evasion, Low Evasion, Moderate Evasion, High Evasion
All were checked manually and wording adapted where required to conform to human-accurate sentiment and evasion categorisation

This dataset (synthetic_QA_df) is designed to assess Phi3.5 accuracy against human-labelled data. It has first been run through RoBERTa and finBERT to create a benchmarking set.
"""

# Run Phi-3.5 function on synthetic data for classification into finBERT topic categories
print("Running Text Classification analysis")
synthetic_QA_df_topics = phi_question_answer(synthetic_QA_df, finBERT_topic_questions, input_aggregated='Y', input_col='qa_text')

# Run Phi-3.5 function on synthetic data for evasion
print("Running Evasion analysis")
synthetic_QA_df_evasion = phi_question_answer(synthetic_QA_df, evasion_questions, input_aggregated='Y', input_col='qa_text')

## Run Phi-3.5 function on synthetic data for sentiment
print("Running Sentiment analysis")
synthetic_QA_df_sentiment = phi_question_answer(synthetic_QA_df, sentiment_questions, input_aggregated='Y', input_col='qa_text')
Running Text Classification analysis
input_aggregated: Y
1/12 ...
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
WARNING:transformers_modules.microsoft.Phi-3.5-mini-instruct.af0dfb8029e8a74545d0736d30cb6b58d2f0f3f0.modeling_phi3:You are not running the flash-attention implementation, expect numerical differences.
5/12 ...
10/12 ...
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
qa_type qa_num qa_text True Sentiment True Evasion Present True Topic Classification RoBERTa Sentiment finBERT Evasion Present finBERT Topic Classification keywords Phi-3.5 Topic Classification
uid
synthetic_1 QA 1 What is the impact of the new marketing strate... Positive Not Evasive Macro Positive Evasive Macro None Markets
synthetic_2 QA 2 How do you view the impact of the current inte... Positive Evasive Fed | Central Banks Positive Evasive Fed | Central Banks None Fed | Central Banks
synthetic_3 QA 3 Can you explain the benefits of the new financ... Positive Evasive Company | Product News Positive Evasive Company | Product News None Company | Product News
synthetic_4 QA 4 What measures are being implemented to improve... Positive Not Evasive Company | Product News Positive Evasive General News | Opinion None Risk Management
synthetic_5 QA 5 What are the recent trends in loan default rat... Neutral Not Evasive Financials Neutral Evasive Treasuries | Corporate Debt None General News
Time taken for 12 QA: 1.37 minutes
Estimate for all transcripts: 0.71 hours
Running Evasion analysis
input_aggregated: Y
1/12 ...
5/12 ...
10/12 ...
qa_type qa_num qa_text True Sentiment True Evasion Present True Topic Classification RoBERTa Sentiment finBERT Evasion Present finBERT Topic Classification keywords Phi-3.5 Evasion Present Phi-3.5 Evasion Degree Phi-3.5 Evaded Topics
uid
synthetic_1 QA 1 What is the impact of the new marketing strate... Positive Not Evasive Macro Positive Evasive Macro None Evasive Moderate loan growth projections, marketing strategy, u...
synthetic_2 QA 2 How do you view the impact of the current inte... Positive Evasive Fed | Central Banks Positive Evasive Fed | Central Banks None Evasive Moderate strategy details, cost mitigation, impact anal...
synthetic_3 QA 3 Can you explain the benefits of the new financ... Positive Evasive Company | Product News Positive Evasive Company | Product News None Evasive Moderate Strategy Details, Impact Analysis, Timeline, C...
synthetic_4 QA 4 What measures are being implemented to improve... Positive Not Evasive Company | Product News Positive Evasive General News | Opinion None Evasive Moderate Strategy Details, Impact Analysis, Cost Mitiga...
synthetic_5 QA 5 What are the recent trends in loan default rat... Neutral Not Evasive Financials Neutral Evasive Treasuries | Corporate Debt None Not Evasive Low Interest rate impact, Net interest margin, Ban...
Time taken for 12 QA: 1.38 minutes
Estimate for all transcripts: 0.72 hours
Running Sentiment analysis
input_aggregated: Y
1/12 ...
5/12 ...
10/12 ...
qa_type qa_num qa_text True Sentiment True Evasion Present True Topic Classification RoBERTa Sentiment finBERT Evasion Present finBERT Topic Classification keywords Phi-3.5 Sentiment Phi-3.5 Positive % Phi-3.5 Negative % Phi-3.5 Neutral % Phi-3.5 Positive Topics Phi-3.5 Negative Topics Phi-3.5 Neutral Topics
uid
synthetic_1 QA 1 What is the impact of the new marketing strate... Positive Not Evasive Macro Positive Evasive Macro None Positive 100.0 0.0 0.0 loan growth, marketing strategy, increased pro...
synthetic_2 QA 2 How do you view the impact of the current inte... Positive Evasive Fed | Central Banks Positive Evasive Fed | Central Banks None Positive 90.0 10.0 0.0 interest rates, lending capabilities, economic...
synthetic_3 QA 3 Can you explain the benefits of the new financ... Positive Evasive Company | Product News Positive Evasive Company | Product News None Positive 100.0 0.0 0.0 competitive rates, innovative features, market...
synthetic_4 QA 4 What measures are being implemented to improve... Positive Not Evasive Company | Product News Positive Evasive General News | Opinion None Positive 70.0 0.0 30.0 Risk Management, Real-Time Data, Mitigation, A... Financial Context, Measures, Implementation, M...
synthetic_5 QA 5 What are the recent trends in loan default rat... Neutral Not Evasive Financials Neutral Evasive Treasuries | Corporate Debt None Neutral 0.0 0.0 100.0 Stability, No Increase, No Decrease Loan Default Rates, Recent Trends, Financial S...
Time taken for 12 QA: 1.38 minutes
Estimate for all transcripts: 0.72 hours
InΒ [Β ]:
# Merge unique columns in synthetic tables
unique_columns_in_topics = [
    col for col in synthetic_QA_df_topics.columns if col not in synthetic_QA_df_evasion.columns
]

unique_columns_in_sentiment = [
    col for col in synthetic_QA_df_sentiment.columns if col not in synthetic_QA_df_evasion.columns
]

synthetic_QA_df_results = pd.merge(synthetic_QA_df_evasion, synthetic_QA_df_topics[unique_columns_in_topics], left_index=True, right_on="uid", how="outer")
synthetic_QA_df_results = pd.merge(synthetic_QA_df_results, synthetic_QA_df_sentiment[unique_columns_in_sentiment], left_index=True, right_on="uid", how="outer")

synthetic_QA_df_results.head()
Out[Β ]:
qa_type qa_num qa_text True Sentiment True Evasion Present True Topic Classification RoBERTa Sentiment finBERT Evasion Present finBERT Topic Classification keywords ... Phi-3.5 Evasion Degree Phi-3.5 Evaded Topics Phi-3.5 Topic Classification Phi-3.5 Sentiment Phi-3.5 Positive % Phi-3.5 Negative % Phi-3.5 Neutral % Phi-3.5 Positive Topics Phi-3.5 Negative Topics Phi-3.5 Neutral Topics
uid
synthetic_1 QA 1 What is the impact of the new marketing strate... Positive Not Evasive Macro Positive Evasive Macro None ... Moderate loan growth projections, marketing strategy, u... Markets Positive 100.0 0.0 0.0 loan growth, marketing strategy, increased pro...
synthetic_10 QA 10 What is the bank's strategy for addressing ris... Negative Evasive Financials Negative Evasive General News | Opinion None ... Moderate loan defaults strategy, specifics, deteriorati... General News Negative 0.0 100.0 0.0 N/A loan defaults, strategy, deteriorating, non-di... context, bank, addressing, rising
synthetic_11 QA 11 What are the main challenges facing our loan p... Negative Evasive Macro Negative Evasive Macro None ... Moderate rising defaults, economic instability, loan po... Loan Portfolio Challenges Negative 0.0 100.0 0.0 N/A loan defaults, economic instability, portfolio... strategies, review, complexities
synthetic_12 QA 12 What steps are being taken to enhance credit r... Negative Not Evasive General News | Opinion Negative Evasive General News | Opinion None ... Moderate strategy details, economic decline, future def... Credit Risk Management Negative 10.0 85.0 5.0 stricter criteria, enhancing strategies, credi... struggling, economic decline, inadequate effor... context, steps, taken
synthetic_2 QA 2 How do you view the impact of the current inte... Positive Evasive Fed | Central Banks Positive Evasive Fed | Central Banks None ... Moderate strategy details, cost mitigation, impact anal... Fed | Central Banks Positive 90.0 10.0 0.0 interest rates, lending capabilities, economic...

5 rows Γ— 21 columns

InΒ [Β ]:
synthetic_QA_df_results.to_csv(output_data_folder + 'phi_synthetic_QA_results.csv')

2.4.2.2 Compare results with other modelsΒΆ

InΒ [Β ]:
# Calculate accuracies of Overall Sentiment, Topic and Question Evasion results
accuracy_results = calculate_accuracy(synthetic_QA_df_results)
Topic Accuracy Results (Significance Threshold 5.0%):
finBERT Topic Classification: 41.67%
Phi-3.5 Topic Classification: 33.33%

Sentiment Accuracy Results (Significance Threshold 33.33%):
RoBERTa Sentiment: 100.0%
Phi-3.5 Sentiment: 83.33%

Evasion Accuracy Results (Significance Threshold 50.0%):
finBERT Evasion Present: 41.67%
Phi-3.5 Evasion Present: 66.67%
No description has been provided for this image

SYNTHETIC DATA RESULTS

Question Evasion

Initial results showed that Phi3.5 was overestimating answers as Evasive (5/6 Not Evasive categorised as Evasive) - this led to initial conclusion of poor performance.

The prompt wording was then adapted to attempt to fine-tune the results, e.g. by adding in examples of Not Evasive answers. This approach was successful, and made the model more accurate (results above).

When the prompt for Phi3.5 focussed on directness of the answer, the results were similar to finBERT, perhaps suggesting Phi was interpreting directness of answer in terms of question topics covered. This 'direct' prompt was removed and is not shown here.

Note that Gemini was also tested on this dataset but was excluded from analysis as it gave very poor Evasion results (likely because its prompt was not optimised). This illustrated the importance of prompt optimisation for such tasks, but was outside of the scope of this project. Therefore only GPT4 was used for benchmarking accuracy.

Phi-3.5 outperformed finBERT for Evasion analysis, as might be expected as finBERT was using topic classification correlation as a proxy for Evasion.

Sentiment Analysis

RoBERTa outperformed Phi-3.5 for Sentiment categorisation, though Phi3.5 still gave good results.

Topic Classification

finBERT outperformed Phi-3.5 for Topic classification, again this is not surprising as finBERT is specifically designed for use with financial datasets.

2.4.3 QA Evasion, Sentiment Analysis and Topic Modelling on Ground Truth Data setΒΆ

2.4.3.0 Preparing ground truth data tablesΒΆ

InΒ [Β ]:
def merge_datasets_for_plotting(finbert_folder, data_folder,
                                chunking=True, max_length=512,
                                bank="JPMorgan", appdx=""):

    path_qa_corr = os.path.join(finbert_folder, f"finbert_QA_topic_correlations{appdx}.csv")
    df_qa_corr = pd.read_csv(path_qa_corr)

    path_finbert_ground_truth = os.path.join(data_folder, f"finbert_topics_ground_truth_{bank}_chunking{chunking}_maxlength{max_length}_QA{appdx}.csv")
    df_finbert_ground_truth = pd.read_csv(path_finbert_ground_truth)

    merged = df_finbert_ground_truth.merge(df_qa_corr, on='uid')
    path_finbert_ground_truth_answers_only = os.path.join(data_folder, f"finbert_topics_ground_truth_{bank}_chunking{chunking}_maxlength{max_length}{appdx}.csv")
    merged.to_csv(path_finbert_ground_truth, index=False)
InΒ [Β ]:
# Load ground truth files (change to filepaths in Inputs/Outputs folders later)
ground_truth_df = pd.read_excel(processed_data_folder + "ground_truth_JPMorgan_manual.xlsx")

# Get RoBERTa ground truth output file
ground_truth_Roberta_result = pd.read_csv(output_data_folder + "sentiment_eval_result_ground_truth.csv")


# merging Q/A evasion correlations with Finbert + ground truth output
merge_datasets_for_plotting(output_data_folder, processed_data_folder,
                                chunking=False, max_length=512,
                                bank="JPMorgan", appdx="")

merge_datasets_for_plotting(output_data_folder, processed_data_folder,
                                chunking=False, max_length=512,
                                bank="JPMorgan", appdx="_summarised")

# Get finBERT ground truth output files for correlation scores
ground_truth_finBERT_result_corr = pd.read_csv(output_data_folder + "finbert_topics_ground_truth_JPMorgan_chunkingFalse_maxlength512.csv")
ground_truth_finBERT_result_corr = ground_truth_finBERT_result_corr[['uid','qa_corr']]
ground_truth_finBERT_result_corr_sum = pd.read_csv(output_data_folder + "finbert_topics_ground_truth_JPMorgan_chunkingFalse_maxlength512_summarised.csv")
ground_truth_finBERT_result_corr_sum = ground_truth_finBERT_result_corr_sum[['uid','qa_corr']]

# Get finBERT ground truth output files for Topics
ground_truth_finBERT_result_topics = pd.read_csv(output_data_folder + "finbert_topics_ground_truth_JPMorgan_chunkingFalse_maxlength512_QA.csv")
ground_truth_finBERT_result_topics_sum = pd.read_csv(output_data_folder + "finbert_topics_ground_truth_JPMorgan_chunkingFalse_maxlength512_QA_summarised.csv")
InΒ [Β ]:
# Evasion and Topic Classification
# Re-name columns to distinguish columns between summarised and non-summarised results
ground_truth_finBERT_result_topics.rename(columns={'finbert_topic_label': 'finBERT Topic Classification'}, inplace=True)
ground_truth_finBERT_result_topics_sum.rename(columns={'finbert_topic_label': 'finBERT(summarised) Topic Classification'}, inplace=True)
ground_truth_finBERT_result_corr_sum.rename(columns={'qa_corr': 'qa_corr_summarised'}, inplace=True)

# Merge results columns from finBERT into ground truth table
ground_truth_df = pd.merge(ground_truth_df, ground_truth_finBERT_result_topics[['uid', 'finBERT Topic Classification']], on='uid', how='left')
ground_truth_df = pd.merge(ground_truth_df, ground_truth_finBERT_result_topics_sum[['uid', 'finBERT(summarised) Topic Classification']], on='uid', how='left')
ground_truth_df = pd.merge(ground_truth_df, ground_truth_finBERT_result_corr[['uid', 'qa_corr']], on='uid', how='left')
ground_truth_df = pd.merge(ground_truth_df, ground_truth_finBERT_result_corr_sum[['uid', 'qa_corr_summarised']], on='uid', how='left')

# Sentiment Analysis
# Merge results column from RoBERTa into ground truth table
ground_truth_df = pd.merge(ground_truth_df, ground_truth_Roberta_result[['uid', 'financial-roberta-large_sentiment']], on='uid', how='left')

# Capitalise each entry in sentiment column and re-name column
ground_truth_df['financial-roberta-large_sentiment'] = ground_truth_df['financial-roberta-large_sentiment'].str.capitalize()
ground_truth_df.rename(columns={'financial-roberta-large_sentiment': 'RoBERTa Sentiment'}, inplace=True)
InΒ [Β ]:
# Tidy up non-aggregated df column names and create an aggregated df with combined Q&A on each row for evasion analysis
# Contains manual check on aggregated data for multiple entries caused by multi-part answers - these must be removed where possible for accurate comparison.
ground_truth_df, ground_truth_df_agg = aggregate_test_data(ground_truth_df)
InΒ [Β ]:
# View data of multi-result fields in aggregated ground truth results - replace with result for longest text string in qa_text
# List of prefixes to search for in the 'uid' column
prefixes = [
    "JPMorganChase_4Q23_A_30",
    "JPMorganChase_3Q23_A_11",
    "JPMorganChase_2Q23_A_23",
    "JPMorganChase_2Q23_A_43",
    "JPMorganChase_1Q23_A_42",
    "JPMorganChase_3Q22_A_29",
    "JPMorganChase_2Q22_A_1",
    "JPMorganChase_2Q22_A_21",
    "JPMorganChase_2Q22_A_41",
    "JPMorganChase_4Q21_A_41",
    "JPMorganChase_4Q21_A_43",
    "JPMorganChase_3Q21_A_8",
    "JPMorganChase_2Q21_A_24",
    "JPMorganChase_2Q21_A_37"
]

manual_corrections = {
    'uid': '',
    'True Sentiment': '',
    'RoBERTa Sentiment': '',
    'True Topic': '',
    'finBERT Topic Classification': '',
    'finBERT(summarised) Topic Classification': ''
}

# Loop through each prefix and filter rows in the non-aggregated DataFrame
for prefix in prefixes:
    filtered_rows_non_agg = ground_truth_df[ground_truth_df['uid'].str.contains(prefix, regex=True)]

    # Find the row with the longest 'qa_text' for each prefix
    if not filtered_rows_non_agg.empty:
        longest_text_row = filtered_rows_non_agg.loc[filtered_rows_non_agg['qa_text'].str.len().idxmax()]
        updated_uid = prefix.replace('_Q_', '_QA_') + ".0"
        updated_uid = prefix.replace('_A_', '_QA_') + ".0"

        # Update the ground_truth_df_agg with values from the non-agg DataFrame
        if updated_uid in ground_truth_df_agg['uid'].values:

            ground_truth_df_agg.loc[
                ground_truth_df_agg['uid'] == updated_uid,
                ['True Sentiment', 'RoBERTa Sentiment', 'True Topic', 'finBERT Topic Classification', 'finBERT(summarised) Topic Classification']
            ] = [
                longest_text_row['True Sentiment'],
                longest_text_row['RoBERTa Sentiment'],
                longest_text_row['True Topic'],
                longest_text_row['finBERT Topic Classification'],
                longest_text_row['finBERT(summarised) Topic Classification']
            ]
InΒ [Β ]:
# Calculate evasion for finBERT (also re-sets index to uid)
ground_truth_df_agg = calculate_finBERT_evasion(ground_truth_df_agg)
print("Aggregated ground truth dataframe")
display(ground_truth_df_agg.head())

ground_truth_df = calculate_finBERT_evasion(ground_truth_df)
print("Ground truth dataframe")
display(ground_truth_df.head())
Aggregated ground truth dataframe
qa_type qa_num qa_text True Sentiment True Topic True Evasion Present RoBERTa Sentiment finBERT Evasion Present finBERT Topic Classification finBERT(summarised) Evasion Present finBERT(summarised) Topic Classification
uid
JPMorganChase_3Q24_QA_28.0 QA 28 And so Daniel's comments in September were on ... Negative Earnings Not Evasive Negative Evasive Financials Evasive Financials
JPMorganChase_2Q24_QA_26.0 QA 62 Very good. And as a follow-up, you've been ver... Positive Financials Not Evasive Positive Evasive General News | Opinion Evasive Macro
JPMorganChase_1Q24_QA_30.0 QA 97 Thank you. And I guess, as a tie-in to that qu... Positive M&A | Investments Not Evasive Neutral Evasive General News | Opinion Evasive Company | Product News
JPMorganChase_1Q24_QA_33.0 QA 100 As a quick follow-up. Where are the next homer... Negative General News | Opinion Not Evasive Neutral Evasive General News | Opinion Evasive Company | Product News
JPMorganChase_4Q23_QA_30.0 QA 131 Just to make sure I understood what you're say... Negative Earnings Evasive Neutral Evasive General News | Opinion Evasive Financials
Ground truth dataframe
qa_type qa_num qa_text True Sentiment True Topic True Evasion Present RoBERTa Sentiment finBERT Evasion Present finBERT(summarised) Evasion Present finBERT Topic Classification finBERT(summarised) Topic Classification
uid
JPMorganChase_3Q24_Q_28.0 Q 28 And so Daniel's comments in September were on ... Neutral Earnings Not Evasive Neutral Evasive Evasive General News | Opinion Fed | Central Banks
JPMorganChase_3Q24_A_28.0 A 28 No. Those were core NII or NII ex.. So again, ... Negative Earnings NaN Negative Evasive Evasive Financials Financials
JPMorganChase_2Q24_Q_26.0 Q 62 Very good. And as a follow-up, you've been ver... Neutral Financials Not Evasive Neutral Evasive Evasive General News | Opinion General News | Opinion
JPMorganChase_2Q24_A_26.0 A 62 Yeah. It's a good question. I think the short ... Positive Financials NaN Positive Evasive Evasive General News | Opinion Macro
JPMorganChase_1Q24_Q_30.0 Q 97 Thank you. And I guess, as a tie-in to that qu... Neutral M&A | Investments Not Evasive Neutral Evasive Evasive General News | Opinion General News | Opinion

2.4.3.1 Running Phi-3.5 on ground truth data setΒΆ

InΒ [Β ]:
"""Aggregated ground truth dataset - Evasion (aggregated ground truth)
    Evasion for Phi must be run on an aggregated dataset, otherwise it will ignore the question.
"""
# Run Phi-3.5 function for Evasion on aggregated data
ground_truth_df_evasion = phi_question_answer(ground_truth_df_agg, evasion_questions, input_aggregated='Y', input_col='qa_text')
input_aggregated: Y
1/42 ...
Keywords found: Basel III
5/42 ...
10/42 ...
15/42 ...
20/42 ...
25/42 ...
30/42 ...
35/42 ...
40/42 ...
Keywords found: Basel III
qa_type qa_num qa_text True Sentiment True Topic True Evasion Present RoBERTa Sentiment finBERT Evasion Present finBERT Topic Classification finBERT(summarised) Evasion Present finBERT(summarised) Topic Classification keywords Phi-3.5 Evasion Present Phi-3.5 Evasion Degree Phi-3.5 Evaded Topics
uid
JPMorganChase_3Q24_QA_28.0 QA 28 And so Daniel's comments in September were on ... Negative Earnings Not Evasive Negative Evasive Financials Evasive Financials None Evasive Moderate 2024 consensus, 2025 consensus, NII ex. Market...
JPMorganChase_2Q24_QA_26.0 QA 62 Very good. And as a follow-up, you've been ver... Positive Financials Not Evasive Positive Evasive General News | Opinion Evasive Macro None Not Evasive Low C&I charge-off rate, Underwriting process, Cre...
JPMorganChase_1Q24_QA_30.0 QA 97 Thank you. And I guess, as a tie-in to that qu... Positive M&A | Investments Not Evasive Neutral Evasive General News | Opinion Evasive Company | Product News None Evasive Moderate company strategy, competitive dynamics, risk a...
JPMorganChase_1Q24_QA_33.0 QA 100 As a quick follow-up. Where are the next homer... Negative General News | Opinion Not Evasive Neutral Evasive General News | Opinion Evasive Company | Product News Basel III Evasive Moderate bank failures, private equity, family offices,...
JPMorganChase_4Q23_QA_30.0 QA 131 Just to make sure I understood what you're say... Negative Earnings Evasive Neutral Evasive General News | Opinion Evasive Financials None Evasive Moderate 2024 consensus, NII Markets, First Republic co...
Time taken for 42 QA: 4.9 minutes
Estimate for all transcripts: 0.73 hours
InΒ [Β ]:
""" Sentiment and Topic Analysis
    Non-aggregated ground truth dataset - Sentiment (non-aggregated ground truth) and Topics
    For true comparability assessment here against RoBERTa, Phi Sentiment analysis is run on non-aggregated set
    However, it is also run on aggregated set for assessment of the method of aggregating RoBERTa sentiments and finBERT topics for the QA_ground_truth set, which would be required to run the combined_questions prompt
"""
# Run Phi-3.5 function for Sentiment/ Topics
print("Sentiment analysis on Q&A ground truth table")
ground_truth_df_sentiment = phi_question_answer(ground_truth_df, sentiment_questions, input_aggregated='N', input_col='qa_text')
print("Topic analysis on Q&A ground truth table")
ground_truth_df_topics = phi_question_answer(ground_truth_df, finBERT_topic_questions, input_aggregated='N', input_col='qa_text')

print("Sentiment analysis on aggregated QA ground truth table")
ground_truth_df_sentiment_agg = phi_question_answer(ground_truth_df_agg, sentiment_questions, input_aggregated='Y', input_col='qa_text')
print("Topic analysis on aggregated QA ground truth table")
ground_truth_df_topics_agg = phi_question_answer(ground_truth_df_agg, finBERT_topic_questions, input_aggregated='Y', input_col='qa_text')
Sentiment analysis on Q&A ground truth table
input_aggregated: N
1/108 ...
5/108 ...
Keywords found: Basel III
10/108 ...
15/108 ...
20/108 ...
25/108 ...
30/108 ...
35/108 ...
40/108 ...
45/108 ...
50/108 ...
55/108 ...
60/108 ...
65/108 ...
70/108 ...
75/108 ...
80/108 ...
85/108 ...
90/108 ...
95/108 ...
100/108 ...
Keywords found: Basel III
105/108 ...
qa_type qa_num qa_text True Sentiment True Topic True Evasion Present RoBERTa Sentiment finBERT Evasion Present finBERT(summarised) Evasion Present finBERT Topic Classification finBERT(summarised) Topic Classification keywords Phi-3.5 Sentiment Phi-3.5 Positive % Phi-3.5 Negative % Phi-3.5 Neutral % Phi-3.5 Positive Topics Phi-3.5 Negative Topics Phi-3.5 Neutral Topics
uid
JPMorganChase_3Q24_Q_28.0 Q 28 And so Daniel's comments in September were on ... Neutral Earnings Not Evasive Neutral Evasive Evasive NaN NaN None Positive 60.0 20.0 20.0 NII growth, 2024 consensus, Asset-Sensitive Nu... Market volatility, Risk assessment, Economic u... Financial context, Daniel's comments, Market c...
JPMorganChase_3Q24_A_28.0 A 28 No. Those were core NII or NII ex.. So again, ... Negative Earnings NaN Negative Evasive Evasive Financials Financials None Negative 10.0 80.0 10.0 2025 consensus, NII, Asset-Sensitive Number, M... Insufficient Decline, Toppy Numbers, Sequentia... Financial Context, Core NII, Market Analysis, ...
JPMorganChase_2Q24_Q_26.0 Q 62 Very good. And as a follow-up, you've been ver... Neutral Financials Not Evasive Neutral Evasive Evasive NaN NaN None Positive 70.0 10.0 20.0 C&I portfolio, strong performance, elevated ra... non-accrual loans increase, potential cracks, ... context clarification, follow-up question, fin...
JPMorganChase_2Q24_A_26.0 A 62 Yeah. It's a good question. I think the short ... Positive Financials NaN Positive Evasive Evasive General News | Opinion Macro None Positive 70.0 0.0 30.0 C&I charge-off rate, credit culture, underwrit... current quarter's results, upward pressure, id...
JPMorganChase_1Q24_Q_30.0 Q 97 Thank you. And I guess, as a tie-in to that qu... Neutral M&A | Investments Not Evasive Neutral Evasive Evasive NaN NaN None Positive 60.0 20.0 20.0 JPMorgan, Private Credit Growth, Competitive E... Competition, Business Bidding, Private Credit ... Context, Question and Answer, Market Analysis
Time taken for 108 QA: 12.21 minutes
Estimate for all transcripts: 1.75 hours
Topic analysis on Q&A ground truth table
input_aggregated: N
1/108 ...
5/108 ...
Keywords found: Basel III
10/108 ...
15/108 ...
20/108 ...
25/108 ...
30/108 ...
35/108 ...
40/108 ...
45/108 ...
50/108 ...
55/108 ...
60/108 ...
65/108 ...
70/108 ...
75/108 ...
80/108 ...
85/108 ...
90/108 ...
95/108 ...
100/108 ...
Keywords found: Basel III
105/108 ...
qa_type qa_num qa_text True Sentiment True Topic True Evasion Present RoBERTa Sentiment finBERT Evasion Present finBERT(summarised) Evasion Present finBERT Topic Classification finBERT(summarised) Topic Classification keywords Phi-3.5 Topic Classification
uid
JPMorganChase_3Q24_Q_28.0 Q 28 And so Daniel's comments in September were on ... Neutral Earnings Not Evasive Neutral Evasive Evasive NaN NaN None Markets
JPMorganChase_3Q24_A_28.0 A 28 No. Those were core NII or NII ex.. So again, ... Negative Earnings NaN Negative Evasive Evasive Financials Financials None Financials
JPMorganChase_2Q24_Q_26.0 Q 62 Very good. And as a follow-up, you've been ver... Neutral Financials Not Evasive Neutral Evasive Evasive NaN NaN None Financials
JPMorganChase_2Q24_A_26.0 A 62 Yeah. It's a good question. I think the short ... Positive Financials NaN Positive Evasive Evasive General News | Opinion Macro None Earnings
JPMorganChase_1Q24_Q_30.0 Q 97 Thank you. And I guess, as a tie-in to that qu... Neutral M&A | Investments Not Evasive Neutral Evasive Evasive NaN NaN None General News
Time taken for 108 QA: 12.19 minutes
Estimate for all transcripts: 1.74 hours
Sentiment analysis on aggregated QA ground truth table
input_aggregated: Y
1/42 ...
Keywords found: Basel III
5/42 ...
10/42 ...
15/42 ...
20/42 ...
25/42 ...
30/42 ...
35/42 ...
40/42 ...
Keywords found: Basel III
qa_type qa_num qa_text True Sentiment True Topic True Evasion Present RoBERTa Sentiment finBERT Evasion Present finBERT Topic Classification finBERT(summarised) Evasion Present finBERT(summarised) Topic Classification keywords Phi-3.5 Sentiment Phi-3.5 Positive % Phi-3.5 Negative % Phi-3.5 Neutral % Phi-3.5 Positive Topics Phi-3.5 Negative Topics Phi-3.5 Neutral Topics
uid
JPMorganChase_3Q24_QA_28.0 QA 28 And so Daniel's comments in September were on ... Negative Earnings Not Evasive Negative Evasive Financials Evasive Financials None Negative 10.0 80.0 10.0 NII Ex. Markets, Consensus, Decline Year-on-Ye... Insufficient Decline, Toppy Consensus, Sequent... Financial Context, Core NII, 2024 Consensus, 2...
JPMorganChase_2Q24_QA_26.0 QA 62 Very good. And as a follow-up, you've been ver... Positive Financials Not Evasive Positive Evasive General News | Opinion Evasive Macro None Positive 80.0 20.0 0.0 C&I portfolio, strong performance, low charge-... non-accrual loans, potential cracks, upward pr... context, follow-up, future outlook
JPMorganChase_1Q24_QA_30.0 QA 97 Thank you. And I guess, as a tie-in to that qu... Positive M&A | Investments Not Evasive Neutral Evasive General News | Opinion Evasive Company | Product News None Positive 80.0 20.0 0.0 private credit growth, competitive position, v... competitive dynamics, concentration risks context, balance, tension
JPMorganChase_1Q24_QA_33.0 QA 100 As a quick follow-up. Where are the next homer... Negative General News | Opinion Not Evasive Neutral Evasive General News | Opinion Evasive Company | Product News Basel III Positive 60.0 20.0 20.0 capital deployment, earnings in store, well-po... bank failures, regional bank failures, competi... bank resolution process, FDIC's attitude, capi...
JPMorganChase_4Q23_QA_30.0 QA 131 Just to make sure I understood what you're say... Negative Earnings Evasive Neutral Evasive General News | Opinion Evasive Financials None Neutral 20.0 30.0 50.0 NII, 2024 consensus, First Republic, Calendari... NII pull to par, First Republic guidance, Fund... Market change, First Republic contribution, An...
Time taken for 42 QA: 4.84 minutes
Estimate for all transcripts: 0.72 hours
Topic analysis on aggregated QA ground truth table
input_aggregated: Y
1/42 ...
Keywords found: Basel III
5/42 ...
10/42 ...
15/42 ...
20/42 ...
25/42 ...
30/42 ...
35/42 ...
40/42 ...
Keywords found: Basel III
qa_type qa_num qa_text True Sentiment True Topic True Evasion Present RoBERTa Sentiment finBERT Evasion Present finBERT Topic Classification finBERT(summarised) Evasion Present finBERT(summarised) Topic Classification keywords Phi-3.5 Topic Classification
uid
JPMorganChase_3Q24_QA_28.0 QA 28 And so Daniel's comments in September were on ... Negative Earnings Not Evasive Negative Evasive Financials Evasive Financials None Financials
JPMorganChase_2Q24_QA_26.0 QA 62 Very good. And as a follow-up, you've been ver... Positive Financials Not Evasive Positive Evasive General News | Opinion Evasive Macro None Earnings
JPMorganChase_1Q24_QA_30.0 QA 97 Thank you. And I guess, as a tie-in to that qu... Positive M&A | Investments Not Evasive Neutral Evasive General News | Opinion Evasive Company | Product News None General News
JPMorganChase_1Q24_QA_33.0 QA 100 As a quick follow-up. Where are the next homer... Negative General News | Opinion Not Evasive Neutral Evasive General News | Opinion Evasive Company | Product News Basel III General News
JPMorganChase_4Q23_QA_30.0 QA 131 Just to make sure I understood what you're say... Negative Earnings Evasive Neutral Evasive General News | Opinion Evasive Financials None Financials
Time taken for 42 QA: 4.92 minutes
Estimate for all transcripts: 0.73 hours
InΒ [Β ]:
# Merge unique columns into single aggregated table result
unique_columns_in_topics_agg = [
    col for col in ground_truth_df_topics_agg.columns if col not in ground_truth_df_evasion.columns
]

unique_columns_in_sentiment_agg = [
    col for col in ground_truth_df_sentiment_agg.columns if col not in ground_truth_df_evasion.columns
]

ground_truth_df_all_results_agg = pd.merge(ground_truth_df_evasion, ground_truth_df_topics_agg[unique_columns_in_topics_agg], left_index=True, right_on="uid", how="outer")
ground_truth_df_all_results_agg = pd.merge(ground_truth_df_all_results_agg, ground_truth_df_sentiment_agg[unique_columns_in_sentiment_agg], left_index=True, right_on="uid", how="outer")
print("Aggregated data results table (all analyses on aggregated QA set):")
display(ground_truth_df_all_results_agg.head())


# Merge unique columns into single non-aggregated table result
# Note that finBERT classification only has results for answers
unique_columns_in_topics = [
    col for col in ground_truth_df_topics.columns if col not in ground_truth_df_sentiment.columns
]

ground_truth_df_all_results = pd.merge(ground_truth_df_sentiment, ground_truth_df_topics[unique_columns_in_topics], left_index=True, right_on="uid", how="outer")
print("Sentiment/Topics results table (all analyses on non-aggregated Q&A set):")
display(ground_truth_df_all_results.head())
Aggregated data results table (all analyses on aggregated QA set):
qa_type qa_num qa_text True Sentiment True Topic True Evasion Present RoBERTa Sentiment finBERT Evasion Present finBERT Topic Classification finBERT(summarised) Evasion Present ... Phi-3.5 Evasion Degree Phi-3.5 Evaded Topics Phi-3.5 Topic Classification Phi-3.5 Sentiment Phi-3.5 Positive % Phi-3.5 Negative % Phi-3.5 Neutral % Phi-3.5 Positive Topics Phi-3.5 Negative Topics Phi-3.5 Neutral Topics
uid
JPMorganChase_1Q22_QA_20.0 QA 408 That doesn’t sound so bad. Yeah. No. I mean, G... Positive Financials Evasive Neutral Not Evasive General News | Opinion Evasive ... Moderate 2024 consensus, NII Markets, Asset-Sensitive N... General News Positive 70.0 10.0 20.0 no concerns, spot metrics, not seeing anything... complacent, risks, around the corner context, Glenn, current situation, metrics, en...
JPMorganChase_1Q22_QA_38.0 QA 426 Okay. And then maybe just skipping over to tra... Neutral Markets Evasive Neutral Evasive General News | Opinion Evasive ... Moderate Trading performance prediction, March quarter ... Fed Positive 60.0 0.0 40.0 stronger quarter, trading, QT normalization, M... context, trading strategy, market expectations...
JPMorganChase_1Q22_QA_9.0 QA 397 Okay. And getting to double digits is over the... Neutral Company | Product News Not Evasive Neutral Evasive General News | Opinion Not Evasive ... Moderate Timeframe, Pace of growth, Specific growth rat... Financials Positive 70.0 0.0 30.0 double digits growth, 7% target, accelerate, I... context, timeframe, pace, financial update
JPMorganChase_1Q23_QA_18.0 QA 235 In your comments about your CET1 ratio, obviou... Negative Fed | Central Banks Not Evasive Negative Evasive Fed | Central Banks Evasive ... None N/A Financials Positive 60.0 20.0 20.0 CET1 ratio, GSIB buffer, Stress Capital Buffer... SCB prediction, Industry surprise, Credit shocks Operating assumptions, Planning purposes, SCB ...
JPMorganChase_1Q23_QA_19.0 QA 236 Sure. And then just as a follow-up, if I heard... Neutral Financials Not Evasive Neutral Evasive General News | Opinion Evasive ... Moderate Commercial Real Estate, Single Name Items, Cor... Earnings Positive 60.0 20.0 20.0 loan loss reserve build, one-off credits, Corp... larger credits, potential risk, single name items context clarification, follow-up question, fin...

5 rows Γ— 23 columns

Sentiment/Topics results table (all analyses on non-aggregated Q&A set):
qa_type qa_num qa_text True Sentiment True Topic True Evasion Present RoBERTa Sentiment finBERT Evasion Present finBERT(summarised) Evasion Present finBERT Topic Classification finBERT(summarised) Topic Classification keywords Phi-3.5 Sentiment Phi-3.5 Positive % Phi-3.5 Negative % Phi-3.5 Neutral % Phi-3.5 Positive Topics Phi-3.5 Negative Topics Phi-3.5 Neutral Topics Phi-3.5 Topic Classification
uid
JPMorganChase_1Q22_A_20.0 A 408 Yeah. No. I mean, Glenn, I think, look, no one... Positive Financials NaN Neutral Not Evasive Evasive General News | Opinion Macro None Positive 80.0 0.0 20.0 no concerns, spot metrics, not seeing anything... complacent, environment, looking around the co... General News
JPMorganChase_1Q22_A_38.0 A 426 Yeah. I mean, you know that we're going to be ... Neutral Markets NaN Neutral Evasive Evasive General News | Opinion Macro None Negative 0.0 100.0 0.0 N/A Reluctance, Predicting, Trading Performance, U... Context, Financial Sentiment, Analysis, Foreca... General News
JPMorganChase_1Q22_A_9.0 A 397 I wasn't meaning to put a timeframe on it. But... Neutral Company | Product News NaN Neutral Evasive Not Evasive General News | Opinion Company | Product News None Positive 70.0 0.0 30.0 Investor Day, five-year plan, strategic growth... timeframe, update frequency, financial context General News
JPMorganChase_1Q22_Q_20.0 Q 408 That doesn’t sound so bad. Neutral General News | Opinion Evasive Neutral Evasive Evasive NaN NaN None Positive 70.0 0.0 30.0 Asset-Sensitive Number, 2024 Consensus, Market... Financial Analysis, Market Predictions, Econom... Earnings
JPMorganChase_1Q22_Q_38.0 Q 426 Okay. And then maybe just skipping over to tra... Neutral Markets Evasive Positive Evasive Evasive NaN NaN None Positive 70.0 10.0 20.0 Trading Strength, QT Normalization, Fed Volati... Fed Concerns, Market Volatility, QT Impact, In... Context Clarification, Market Analysis, Tradin... Fed
InΒ [Β ]:
ground_truth_df_all_results.to_csv(output_data_folder + 'phi_ground_truth_results.csv')
ground_truth_df_all_results_agg.to_csv(output_data_folder + 'phi_ground_truth_results_agg.csv')

2.4.3.2 Compare results with other models (Output)ΒΆ

InΒ [Β ]:
# Assess validity of aggregating RoBERTa and finBERT results (needed for direct comparison with Evasive results later)
print("Non-aggregated answers:")
accuracy_results_gt = calculate_accuracy(ground_truth_df_all_results, sentiment_flag='Y', evasion_flag='N', topic_flag='Y')
Non-aggregated answers:

Topic Accuracy Results (Significance Threshold 5.0%):
finBERT Topic Classification: 12.04%
finBERT(summarised) Topic Classification: 25.93%
Phi-3.5 Topic Classification: 25.0%

Sentiment Accuracy Results (Significance Threshold 33.33%):
RoBERTa Sentiment: 66.67%
Phi-3.5 Sentiment: 38.89%
No description has been provided for this image
InΒ [Β ]:
print("Aggregated answer Results:")
accuracy_results_gt_agg = calculate_accuracy(ground_truth_df_all_results_agg, sentiment_flag='Y', evasion_flag='Y', topic_flag='Y')
Aggregated answer Results:

Topic Accuracy Results (Significance Threshold 5.0%):
finBERT Topic Classification: 9.52%
finBERT(summarised) Topic Classification: 28.57%
Phi-3.5 Topic Classification: 23.81%

Sentiment Accuracy Results (Significance Threshold 33.33%):
RoBERTa Sentiment: 61.9%
Phi-3.5 Sentiment: 35.71%

Evasion Accuracy Results (Significance Threshold 50.0%):
finBERT Evasion Present: 40.48%
finBERT(summarised) Evasion Present: 50.0%
Phi-3.5 Evasion Present: 61.9%
No description has been provided for this image

Topic Classification

Aggregating answers improved finBERT performance for summarised texts, with aggregated summarised answer analysis giving the best accuracy (28.6%), (finBERT results were aggregated where there was more than one answer to a question by taking the topic class of the longest answer). The increased accuracy for summarised data likely reflects the removal of irrelavant (non-financial) language from the text by Phi-3.5, and increased accuracy for aggregated answers gives better content for more accurate topic classification.

Phi3.5 performance was slightly worsened by summarisation, which likely reflects its primary usage as a generalist language model performing better with larger context.

Therefore: use finBERT(summarised) with answer aggregation is better than Phi-3.5 for further classification analysis.

Sentiment

The opposite effect (decreased performance) was seen for answer-aggregated sentiment analysis (where the sentiment of the longest answer was taken for the whole answer sentiment, where multiple answers were present). RoBERTA decreased from 66.7% to 61.9%.

Therefore: preferable to use RoBERTa without answer aggregation, but for Negative Evasion analysis (which requires aggregated answers) RoBERTa is still preferable to Phi-3.5.

Question-Evasion

Can only be performed on aggregated QA set by Phi-3.5 - outperforms all other models with 62% accuracy.

Accuracy of evasion analysis was much improved by running the evasion prompt question separately from the sentiment question, but this means proceeding to analysing full transcript sets of Q&As is not viable as running the analysis multiple times for each prompt would negate the proposed time/resource-saving benefits of using Phi as a replacement for other analysis methods.

As Phi is the most accurate question-evasion method tested, we will apply it to analyse Q&As from transcripts covering specific quarters of interest (identified in EDA section).

3. AnalysisΒΆ

3.0 Running model on the full dataset to get responsesΒΆ

3.0.1 RoBERTa sentimentΒΆ

InΒ [Β ]:
full_data = pd.read_csv(processed_data_folder + "/transcripts_tabular_JPMorgan_clean.csv")
full_data = full_data[full_data['qa_type'].isin(['Q', 'A'])]
roberta_model = ClassificationModel("financial-roberta-large", "soleimanian/financial-roberta-large-sentiment")
full_data = roberta_model.get_model_response_for_df(full_data, 'qa_text')
full_data.to_csv(output_data_folder + "/sentiment_full_result.csv", index=False)

3.0.2 Phi-3.5 by quarterΒΆ

InΒ [Β ]:
qa_df = pd.read_csv(processed_data_folder + "/transcripts_tabular_JPMorgan_clean.csv")
InΒ [Β ]:
"""
Interrogate QA pairs from specific quarters from pre-processed table for Sentiment and Question-Evasion Analysis (with associated topics - non-categorical)
This analysis was only performed on the aggregated dataset, in order to analyse Phi-3.5 identified Negative Evasive topics
    Note that a more comprehensive approach would analyse the quarters separately using RoBERTa, and then compare these sentiments to Phi-3.5-identified Evasion.
    However, for a simple exploration of the capabilities of Phi-3.5, a Phi-3.5-only approach was used.
"""

# Define quarters of interest
quarters_of_interest = ['1Q22', '2Q24', '3Q24']

# Filter the DataFrame for rows where the 'uid' column contains any of the quarters of interest
quarter_filtered_df = qa_df[qa_df['uid'].str.contains('|'.join(quarters_of_interest))]

# Aggregate df for Phi3.5 evasion analysis
agg_quarters_df = aggregate_QA_data(quarter_filtered_df)
Original DataFrame: 167 rows
uid bank year quarter date section name title firm qa_type qa_num_within qa_num qa_text
1 JPMorganChase_3Q24_Q_1.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jim Mitchell Analyst Seaport Global Securities LLC Q 1.0 1.0 Hey, good morning. So, Jeremy, as you highligh...
2 JPMorganChase_3Q24_A_1.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. A 1.0 1.0 Yeah. Sure, Jim. I'll try to answer both quest...
3 JPMorganChase_3Q24_Q_3.0 JPMorganChase 2024 3 2024-10-11 questions_answers Steven Chubak Analyst Wolfe Research LLC Q 3.0 3.0 Hi. Good morning. So Jeremy, how are you? So I...
4 JPMorganChase_3Q24_A_3.0 JPMorganChase 2024 3 2024-10-11 questions_answers Jeremy Barnum Chief Financial Officer JPMorgan Chase & Co. A 3.0 3.0 Sure. So good question and I agree with your n...
5 JPMorganChase_3Q24_A_3.1 JPMorganChase 2024 3 2024-10-11 questions_answers Jamie Dimon Chairman & Chief Executive Officer JPMorgan Chase & Co. A 3.0 3.0 And can I just give you just a view of expense...
Aggregated QA DataFrame: 73 rows
qa_type qa_num qa_text
uid
JPMorganChase_3Q24_QA_1.0 QA 1.0 Hey, good morning. So, Jeremy, as you highligh...
JPMorganChase_3Q24_QA_3.0 QA 3.0 Hi. Good morning. So Jeremy, how are you? So I...
JPMorganChase_3Q24_QA_4.0 QA 4.0 Thank you both for the color. Just a quick fol...
JPMorganChase_3Q24_QA_6.0 QA 6.0 My first question, and thank you very much for...
JPMorganChase_3Q24_QA_7.0 QA 7.0 And if I can ask my second question, and, Jami...
InΒ [Β ]:
# Run Phi-3.5 function for Evasion and Sentiment analysis (both create non-categorical topics for each sentiment and evasion)
print(f'Running Evasion analysis on {quarters_of_interest}')
quarter_evasion_df = phi_question_answer(agg_quarters_df, evasion_questions, input_aggregated='Y', input_col='qa_text')
print(f'Running Sentiment analysis on {quarters_of_interest}')
quarter_sentiment_df = phi_question_answer(agg_quarters_df, sentiment_questions, input_aggregated='Y', input_col='qa_text')
Running Evasion analysis on ['1Q22', '2Q24', '3Q24']
input_aggregated: Y
1/73 ...
5/73 ...
Keywords found: Basel III
10/73 ...
15/73 ...
20/73 ...
Keywords found: PRA, Basel III
25/73 ...
30/73 ...
Keywords found: Basel III
35/73 ...
Keywords found: Basel III
Keywords found: Basel III
40/73 ...
45/73 ...
50/73 ...
55/73 ...
60/73 ...
65/73 ...
70/73 ...
qa_type qa_num qa_text keywords Phi-3.5 Evasion Present Phi-3.5 Evasion Degree Phi-3.5 Evaded Topics
uid
JPMorganChase_3Q24_QA_1.0 QA 1.0 Hey, good morning. So, Jeremy, as you highligh... None Evasive Moderate Yield Curve, Deposit Behavior, Pricing, NII Tr...
JPMorganChase_3Q24_QA_3.0 QA 3.0 Hi. Good morning. So Jeremy, how are you? So I... None Evasive Moderate expense forecast, core expense base, annualiza...
JPMorganChase_3Q24_QA_4.0 QA 4.0 Thank you both for the color. Just a quick fol... None Not Evasive None N/A
JPMorganChase_3Q24_QA_6.0 QA 6.0 My first question, and thank you very much for... None Evasive Moderate 2025 consensus, NII Markets, Yield Curve, Empi...
JPMorganChase_3Q24_QA_7.0 QA 7.0 And if I can ask my second question, and, Jami... None Evasive Moderate Capital Deployment, Direct Lending, Innovation...
Time taken for 73 QA: 8.75 minutes
Estimate for all transcripts: 0.75 hours
Running Sentiment analysis on ['1Q22', '2Q24', '3Q24']
input_aggregated: Y
1/73 ...
5/73 ...
Keywords found: Basel III
10/73 ...
15/73 ...
20/73 ...
Keywords found: PRA, Basel III
25/73 ...
30/73 ...
Keywords found: Basel III
35/73 ...
Keywords found: Basel III
Keywords found: Basel III
40/73 ...
45/73 ...
50/73 ...
55/73 ...
60/73 ...
65/73 ...
70/73 ...
qa_type qa_num qa_text keywords Phi-3.5 Sentiment Phi-3.5 Positive % Phi-3.5 Negative % Phi-3.5 Neutral % Phi-3.5 Positive Topics Phi-3.5 Negative Topics Phi-3.5 Neutral Topics
uid
JPMorganChase_3Q24_QA_1.0 QA 1.0 Hey, good morning. So, Jeremy, as you highligh... None Negative 20.0 60.0 20.0 NII growth, sequential increase, trough predic... sequential decline, yield curve, deposit behavior Fed rate cuts, yield-seeking behavior, QT end
JPMorganChase_3Q24_QA_3.0 QA 3.0 Hi. Good morning. So Jeremy, how are you? So I... None Positive 70.0 20.0 10.0 Investment Growth, Strategy Execution, Cost Ef... Expense Forecast, Inflation, Annualization, Ex... Expense Guidance, Budget Cycle, Investor Day
JPMorganChase_3Q24_QA_4.0 QA 4.0 Thank you both for the color. Just a quick fol... None Positive 60.0 20.0 20.0 NII, Yield Expansion, Policy Rate, Duration Ma... Curve Inversion, Extreme Scenarios, Rate Cuts,... Excess Reserves, Fed, Securities, Forward Curv...
JPMorganChase_3Q24_QA_6.0 QA 6.0 My first question, and thank you very much for... None Negative 30.0 40.0 30.0 NII, Consensus, Growth Resume Yield Curve, Fed Cuts, EaR Adjustment Model, Empirical, Deposit Betas
JPMorganChase_3Q24_QA_7.0 QA 7.0 And if I can ask my second question, and, Jami... None Positive 60.0 20.0 20.0 capital generation, return on equity, sharehol... market volatility, asset inflation, cautious a... quarterly earnings, capital deployment, shareh...
Time taken for 73 QA: 8.56 minutes
Estimate for all transcripts: 0.73 hours
InΒ [Β ]:
# Merge results back into single dataframe
quarter_sentiment_df = quarter_sentiment_df.drop(columns=['qa_type', 'qa_num', 'qa_text','keywords'])
quarter_df_results = pd.merge(quarter_evasion_df, quarter_sentiment_df, left_index=True, right_index=True, how="outer")

# Extract quarter and year information and add to dataframe
quarter_df_results['quarter_ID'] = quarter_df_results.index.to_series().str.extract(r'_(\dQ\d{2})_')
quarter_df_results['year'] = quarter_df_results['quarter_ID'].str[-2:].apply(lambda x: '20' + x)

display(quarter_df_results.head())
qa_type qa_num qa_text keywords Phi-3.5 Evasion Present Phi-3.5 Evasion Degree Phi-3.5 Evaded Topics Phi-3.5 Sentiment Phi-3.5 Positive % Phi-3.5 Negative % Phi-3.5 Neutral % Phi-3.5 Positive Topics Phi-3.5 Negative Topics Phi-3.5 Neutral Topics quarter_ID year
uid
JPMorganChase_1Q22_QA_1.0 QA 389 Thank you. Morning, Jeremy. Was wondering abou... NaN Evasive Moderate Net Interest Income Expectations, Fed Hike Imp... Positive 60 40 0 Net Interest Income, Excess, Increase, Investo... Rate Expectations, Fed Hike, Front End of the ... NaN 1Q22 2022
JPMorganChase_1Q22_QA_10.0 QA 398 Okay. And then just the follow up here is on t... NaN Evasive Moderate NII Outlook, Reinvestment Strategy, Investment... Positive 70 10 20 NII outlook, Investor Day, Confidence, 17% ROT... Rate Environment, Short-term Headwinds, Fed's ... Reinvestment, Historical Practice, Future Spen... 1Q22 2022
JPMorganChase_1Q22_QA_12.0 QA 400 Hey. Good morning. So wanted to start off with... NaN Evasive Moderate Fed balance sheet reduction, Deposit outflow e... Neutral 10 40 50 Fed policy, Deposit outflow, Industry perspect... Fed balance sheet reduction, Deposit outflow r... Financial context, Question, Fed outlining, Ex... 1Q22 2022
JPMorganChase_1Q22_QA_13.0 QA 401 Hey, Steve. So this is a fun question. So let'... NaN Evasive Moderate company's strategy, impact of QT, effect on ba... Positive 60 20 20 NII growth, robust loan growth, share wins, pi... QT headwind, bill maturity decisions, short-da... QE cycle, RRP dynamics, Fed minutes, system-wi... 1Q22 2022
JPMorganChase_1Q22_QA_14.0 QA 402 No, that's really helpful color. Thanks for al... NaN Not Evasive NaN NaN Positive 60 30 10 Fed hikes, Market Underestimation, Excess Liqu... Inflation Concerns, Market Volatility, QT Impa... Shareholder Letter, Investment Portfolio, Cust... 1Q22 2022
InΒ [Β ]:
quarter_df_results.to_csv(output_data_folder + "/phi_quarter_results.csv")

3.1 22Q1ΒΆ

We run a small number of analysis on 22Q1 to show proof of concepts. More detailed analysis on performed on 24Q2 and 24Q3 to identify emerging risks.

3.1.1 SentimentΒΆ

InΒ [Β ]:
full_result = pd.read_csv(output_data_folder + "/sentiment_full_result.csv")
full_result['date'] = pd.to_datetime(full_result['date'])
InΒ [Β ]:
score_by_quarter = full_result.groupby('date')['financial-roberta-large_score'].mean()
year_quarter = full_result[['date', 'year', 'quarter']].drop_duplicates()
score_by_quarter = pd.merge(score_by_quarter, year_quarter, on='date', how='left')
score_by_quarter['year_quarter'] = score_by_quarter['quarter'].astype(str) + 'Q' + (score_by_quarter['year']-2000).astype(str)
score_by_quarter.set_index('year_quarter', inplace=True)

fig, ax1 = plt.subplots(figsize=(10,3))

sns.lineplot(x=score_by_quarter.index, y=score_by_quarter['financial-roberta-large_score'], ax=ax1,
              color='orange', lw=2, label='average sentiment', legend=False)
xticks = [f"{row['quarter']}Q{row['year']-2000}" for _, row in score_by_quarter.iterrows()]
plt.xticks(ticks=score_by_quarter.index, labels=xticks, rotation=25)
ax1.set_xlabel('')
ax1.set_ylabel('Average sentiment score',  fontsize=14)
ax1.set_ylim(-0.3, 0.3)
for label in ax1.get_yticklabels():
  label.set_size(fontsize=14)
plt.axhline(y=0, lw=0.8, ls=(0, (5,10)), color='black')
ax1.axvspan(2,4, facecolor='orange', alpha=0.2, edgecolor=None)
ax1.set_title("Average sentiment score over time")
for label in ax1.get_xticklabels():
  label.set_size(fontsize=12)

fig.legend(fontsize=14, loc='center left', bbox_to_anchor=(0.9, 0.5))
No description has been provided for this image
InΒ [Β ]:
quarter_count = full_result.groupby(['year', 'quarter'])['financial-roberta-large_sentiment'].value_counts(normalize=True).reset_index()
quarter_count = quarter_count.pivot(index=['year', 'quarter'], columns='financial-roberta-large_sentiment', values='proportion').reset_index()

fig, ax1 = plt.subplots(figsize=(10,3))
sns.lineplot(x=quarter_count['quarter'].astype(str)+'Q'+(quarter_count['year']-2000).astype(str),
             y=quarter_count['negative'],
             label='negative',lw=2,
             color='lightblue',
             legend=False)
sns.lineplot(x=quarter_count['quarter'].astype(str)+'Q'+(quarter_count['year']-2000).astype(str),
             y=quarter_count['positive'],
             label='positive',
             color='lightcoral',
             ls='dashed', lw=2,
             legend=False)
sns.lineplot(x=quarter_count['quarter'].astype(str)+'Q'+(quarter_count['year']-2000).astype(str),
             y=quarter_count['neutral'],
             label='neutral',
             color='grey',lw=2,
             ls='dashdot',
             legend=False)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('Proportion of sentiment', fontsize=14)
for label in ax1.get_yticklabels():
  label.set_size(fontsize=14)
ax1.axvspan(2,4, facecolor='orange', alpha=0.2, edgecolor=None)

for label in ax1.get_xticklabels():
  label.set_size(fontsize=12)

fig.legend(fontsize=14, loc='center left', bbox_to_anchor=(0.9, 0.5))
No description has been provided for this image

3.1.2 Topic modelling on negative-sentiment textsΒΆ

3.1.2.1 BERTopicΒΆ

InΒ [Β ]:
# Read the sentiment and summarised data
sentiment_df = pd.read_csv(output_data_folder + "/sentiment_full_result.csv")
df_phi_fulltable_summarised = pd.read_excel(processed_data_folder + '/phi_fulltable_summarised.xlsx')

# subset relevant columns from sentiment file
sentiment_colname = 'financial-roberta-large_sentiment'
sentiment_df_22Q1= sentiment_df[(sentiment_df.year == 2022) & (sentiment_df.quarter == 1)]
sentiment_df_22Q1 = sentiment_df_22Q1[['uid', sentiment_colname]]

# Subset relevant columns from summarised text file
df_phi_fulltable_summarised_2022Q1 = df_phi_fulltable_summarised[(df_phi_fulltable_summarised.year == 2022) & 
                                                                 (df_phi_fulltable_summarised.quarter == 1)]
summarised_22Q1_df = df_phi_fulltable_summarised_2022Q1[['uid', 'summarised_text']]

# Further filter sentiment to only include negative sentiments
sentiment_df_22Q1 = sentiment_df_22Q1[sentiment_df_22Q1[sentiment_colname] == 'negative'][['uid', sentiment_colname]]

# Merge neagtive data with ground truth data
df_sentiment_22Q1_summarised = pd.merge(sentiment_df_22Q1,summarised_22Q1_df , on=['uid'], how='inner')
InΒ [Β ]:
df_phi_2022Q1_neg_list = df_sentiment_22Q1_summarised['summarised_text'].apply(preprocess_spacy).to_list()
dim_model = PCA(n_components=11)
cluster_model = KMeans(n_clusters=11)

topic_model_phi_2022Q1_neg = BERTopic(embedding_model=embedding_model,
                       hdbscan_model=cluster_model, calculate_probabilities=True)
topics, probabilities = topic_model_phi_2022Q1_neg.fit_transform(df_phi_2022Q1_neg_list)
# Reduce topics with higher diversity
topic_model_phi_2022Q1_neg = topic_model_phi_2022Q1_neg.reduce_topics(df_phi_2022Q1_neg_list, nr_topics=4)
InΒ [Β ]:
# Plot each BERTtopic visualization into a subplot
topic_model_phi_2022Q1_neg.visualize_topics().write_html("topic_model_phi_2022Q1_neg_topic.html")
topic_model_phi_2022Q1_neg.visualize_barchart(n_words=10, autoscale=True).write_html("topic_model_phi_2022Q1_neg_topic_barchart.html")
topic_model_phi_2022Q1_neg.visualize_heatmap().write_html("topic_model_phi_2022Q1_neg_topic_heatmap.html")
topic_model_phi_2022Q1_neg.visualize_hierarchy().write_html("topic_model_phi_2022Q1_neg_topic_hierarchy.html")

# Load each plot into a subplot

#display(HTML("topic_model_phi_2022Q1_neg_topic.html"))
display(HTML("topic_model_phi_2022Q1_neg_topic_barchart.html"))
display(HTML("topic_model_phi_2022Q1_neg_topic_heatmap.html"))
display(HTML("topic_model_phi_2022Q1_neg_topic_hierarchy.html"))

plt.tight_layout()
plt.show()

image.png

3.1.2.2 FinBERTΒΆ

3.1.2.2.0 FunctionsΒΆ
InΒ [Β ]:
# Function to standardise formatting for the particular analysts: by omitting middle names and using full names.
def standardise_analyst_names(df, bank):
    if bank=='JPMorgan':
        analyst_dict = {
            "Charles W. Peabody": "Charles Peabody",
            "Ebrahim H. Poonawala": "Ebrahim Poonawala",
            "Jim Mitchell": "James Mitchell",
            "John E. McDonald": "John McDonald",
            "Kenneth M. Usdin": "Kenneth Usdin",
            "Ken Usdin": "Kenneth Usdin",
            "Matt O’Connor": "Matt O'Connor"
        }
    return df['name'].replace(analyst_dict)

# Plotting the topic probability distributions for a specific condition (e.g. analyst name, quarter, etc.)
def plot_topics_by_condition(condition_dict, finbert_folder, data_folder, label_dict,
                               chunking=True, max_length=512, bank='JPMorgan',
                              datatype='all', appdx="", summarised=False,
                             synthetic=False, sentiment=None, sentiment_folder=None,
                             save=False):
    """
    condition_dict should be a dictionary of len=1
    the key should be a column name, the value should be a value in that column
    """

    df_merged, _, metadata = get_merged_data_for_plotting(finbert_folder, data_folder,
                                             label_dict, chunking=chunking, appdx=appdx,
                                             max_length=max_length, bank=bank, datatype=datatype,
                                            summarised=summarised, synthetic=synthetic,
                                            sentiment=sentiment, sentiment_folder=sentiment_folder)


    key = list(condition_dict.keys())[0]
    # standardise analyst names (see 5.1 below - adding it here to reuse this function later)
    if key=='name':
        df_merged['name'] = standardise_analyst_names(df_merged, bank)
    if type(condition_dict[key]) != list:
        condition_dict[key] = [condition_dict[key]]
    df_qrt = df_merged[df_merged[key].isin(condition_dict[key])].copy()

    topic_labels = list(label_dict.keys())
    cmap = sns.color_palette("tab20")

    prob_cols = [f"topic_{i}_prob" for i in label_dict.keys()]

    # order topic labels by their prevalence
    if sentiment is None or sentiment=='negative':
        topic_labels = df_qrt[prob_cols].median(axis=0).sort_values(ascending=False).index
    else:   # have the same order as in negative sentiment texts
        df_merged_neg, _, _ = get_merged_data_for_plotting(finbert_folder, data_folder,
                                             label_dict, chunking=chunking, appdx=appdx,
                                             max_length=max_length, bank=bank, datatype=datatype,
                                            summarised=summarised, synthetic=synthetic,
                                            sentiment='negative', sentiment_folder=sentiment_folder)
        if key=='name':
            df_merged_neg['name'] = standardise_analyst_names(df_merged_neg, bank)
        df_qrt_neg = df_merged_neg[df_merged_neg[key].isin(condition_dict[key])].copy()
        topic_labels = df_qrt_neg[prob_cols].median(axis=0).sort_values(ascending=False).index

    fig, ax = plt.subplots(1,1, figsize=(5,5), sharex=True)
    ax.vlines(0.05, 0, len(topic_labels), ls='dashed', color='grey', lw=0.5)
    for num, topic_label in enumerate(topic_labels):
        i = int(topic_label.split("_")[1])

        # subset the df to keep only one topic
        cols_to_keep = ["finbert_topic_id", topic_label]
        df_topic = df_qrt[cols_to_keep].copy()

        # plot
        sns.boxplot(x=df_topic[topic_label],
                    y=[num]*df_topic.shape[0],
                    orient='h',
                    showfliers=True,
                    color=cmap[i],
                    flierprops=dict(marker='o', markerfacecolor=cmap[i], markersize=6, alpha=0.7)
                    )
    fsize=14
    ax.set_yticks(np.arange(len(topic_labels)),
                  labels=[label_dict[int(i.split("_")[1])] for i in topic_labels],
                  fontsize=fsize)
    ax.set_xlabel("Topic probability", fontsize=fsize)
    ax.tick_params(axis='x', labelsize=fsize)
    # axes2.set_yticklabels(topics, rotation=0, fontsize=fsize)
    if save:
        fig.savefig(f"topic_boxplots_{key}_{condition_dict.values()}.png",
                        dpi=300, bbox_inches="tight")
3.1.2.2.1 Plotting probability distributionsΒΆ

We will start by plotting the topic probability distribution of negative texts in the quarter of interest.

InΒ [Β ]:
plot_topics_by_condition(condition_dict={"quarter_str": ["1Q22"]},
                          finbert_folder=output_data_folder,
                          data_folder=output_data_folder,
                          label_dict=id2label,
                          max_length=512,
                          chunking=False,
                          bank='JPMorgan',
                          datatype='QA',
                         sentiment='negative',
                         appdx='_summarised',
                         sentiment_folder=output_data_folder,
                         save=True)
No description has been provided for this image

Topics with the highest probabilities in negative sentiment texts are Macro, General News | Opinion, and Fed | Central Banks. However, theseare fairly frequently assigned in general and might not reflect an effect specific to negative sentiments.

3.1.3 EvasionΒΆ

InΒ [Β ]:
# Filter Dataframe for quarter 1Q22
filtered_df_1Q22 = quarter_df_results[(quarter_df_results['quarter_ID'] == '1Q22')]
InΒ [Β ]:
# Plot a pie chart
def plot_pie_chart(data, title):
  results_series = data.squeeze()
  results_counts = results_series.value_counts()

  plt.figure(figsize=(5, 5))
  plt.pie(results_counts, labels=results_counts.index, autopct='%1.1f%%', startangle=90, colors=['lightcoral', 'lightblue'])
  plt.title(f"Evasive vs Not-Evasive answers in {title}")
  plt.show()
InΒ [Β ]:
# Plot a pie chart
plot_pie_chart(filtered_df_1Q22['Phi-3.5 Evasion Present'], '1Q22')
No description has been provided for this image

3.1.3.1 Topic modelling on evasive textsΒΆ

3.1.3.1.1 BERTopicΒΆ
InΒ [Β ]:
def extract_keywords(model):
    topic_words = set()
    for topic_num in range(len(model.get_topics())):
        # Extract the words for each topic
        words = [word for word, _ in model.get_topic(topic_num)]
        topic_words.update(words)
        # Print topic number, words, and the number of words
        print(f"Topic {topic_num}: {words} (Count: {len(words)})")
    return topic_words
InΒ [Β ]:
# subset relevant columns from combined evasion file
qevasion_colname = 'Phi-3.5 Evasion Present'
qevasion_df = filtered_df_1Q22[['qa_num', qevasion_colname]]

# subset relevant columns from summarised text file
fulltable_summarised_df = df_phi_fulltable_summarised[['uid', 'qa_num', 'summarised_text']]
qevasion_status='Evasive'
df_qevasion_summarised = pd.merge(fulltable_summarised_df, qevasion_df, on=['qa_num'], how='left')

# subset relevant columns from sentiment file
qevasion_colname = 'Phi-3.5 Evasion Present'
qevasion_df = qevasion_df[['qa_num', qevasion_colname]]

# subset relevant columns from full summarised text file
fulltable_summarised_df = fulltable_summarised_df[['uid', 'qa_num', 'summarised_text']]

# merge two dataset
qevasion_status='Evasive'
df_qevasion_summarised = pd.merge(fulltable_summarised_df, qevasion_df, on=['qa_num'], how='left')
df_qevasion_summarised['quarter_str'] = [x.split("_")[1] for x in df_qevasion_summarised['uid']]
df_evaded_summarised = df_qevasion_summarised[df_qevasion_summarised[qevasion_colname].eq(qevasion_status)].copy().reset_index(drop=True)
InΒ [Β ]:
quarter_str_status_22 = '1Q22'
quarter_colname = 'quarter_str'
df_evaded_summarised_22Q1 = df_evaded_summarised[df_evaded_summarised[quarter_colname].eq(quarter_str_status_22)].copy().reset_index(drop=True)
df_evaded_summarised_22Q1_list = df_evaded_summarised_22Q1['summarised_text'].apply(preprocess_spacy).to_list()
InΒ [Β ]:
dim_model = PCA(n_components=5)
cluster_model = KMeans(n_clusters=20)

topic_evaded_summarised_22Q1 = BERTopic(umap_model=dim_model, embedding_model=embedding_model,
                       hdbscan_model=cluster_model, calculate_probabilities=True)
topics, probs = topic_evaded_summarised_22Q1.fit_transform(df_evaded_summarised_22Q1_list)
topic_evaded_summarised_22Q1 = topic_evaded_summarised_22Q1.reduce_topics(df_evaded_summarised_22Q1_list, nr_topics=5)
InΒ [Β ]:
# Plot each BERTtopic visualization into a subplot
topic_evaded_summarised_22Q1.visualize_topics().write_html("topic_evaded_summarised_22Q1_topic.html")
topic_evaded_summarised_22Q1.visualize_barchart(top_n_topics=20,n_words=8, autoscale=True).write_html("topic_evaded_summarised_22Q1_barchart.html")
topic_evaded_summarised_22Q1.visualize_heatmap().write_html("topic_evaded_summarised_22Q1_heatmap.html")
topic_evaded_summarised_22Q1.visualize_hierarchy().write_html("topic_evaded_summarised_22Q1_hierarchy.html")

# Load each plot into a subplot

#display(HTML("topic_evaded_summarised_22Q1_topic.html"))
display(HTML("topic_evaded_summarised_22Q1_barchart.html"))
display(HTML("topic_evaded_summarised_22Q1_heatmap.html"))
display(HTML("topic_evaded_summarised_22Q1_hierarchy.html"))

plt.tight_layout()
plt.show()

image.png

3.1.4 Evasion + NegativityΒΆ

InΒ [Β ]:
# Extract keywords from each model
keywords_evaded_22Q1 = extract_keywords(topic_evaded_summarised_22Q1)
keywords_phi_22Q1neg = extract_keywords(topic_model_phi_2022Q1_neg)
InΒ [Β ]:
# Create a Venn diagram
venn_22 = venn2(
    [keywords_evaded_22Q1, keywords_phi_22Q1neg],
    ('Evaded', 'Negative')
)
# Set colors
venn_22.get_patch_by_id('10').set_color('#FFBF00')  # Evaded only
venn_22.get_patch_by_id('01').set_color('#00BFFF')    # Negative only
# Add a title
plt.title("BERTopic Word Frequencies")
plt.show()
No description has been provided for this image
InΒ [Β ]:
# Inspect overlaps
intersection_evaded_phi_22Q1neg = keywords_evaded_22Q1 & keywords_phi_22Q1neg
print("Overlap between Evaded and Phi Negative:", intersection_evaded_phi_22Q1neg)
Overlap between Evaded and Phi Negative: {'inflation', 'financial', 'price', 'capital', 'excessive', 'ukraine', 'performance', 'criticizes', 'growth', 'war', 'indicating', 'discusses', 'detracts', 'discussion', 'viewing'}

3.2 Recent two quartersΒΆ

By looking at the most two recent quarters, we can identify any emerging risks.

3.2.0 Initial data explorationΒΆ

InΒ [Β ]:
# Getting the unique metric types
unique_metric_types = metrics_df['metric_type'].unique()

# Defining colors for each plot
colors = ['blue', 'green', 'red', 'purple']

# Plotting in a 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(18, 9))

# Flattening axes array for easy iteration
axes = axes.flatten()

# Looping through metric types and corresponding axes
for i, metric in enumerate(unique_metric_types):
    subset = metrics_df[metrics_df['metric_type'] == metric]
    axes[i].plot(subset['Q&FY'], subset['metric_value'], marker='o', color=colors[i])
    axes[i].set_title(f'Metric: {metric}', fontsize=16, color=colors[i])
    axes[i].set_xlabel('Quarter', fontsize=14)
    axes[i].set_ylabel('Metric Value', fontsize=14)
    axes[i].tick_params(axis='x', labelsize=12, rotation=45)
    axes[i].tick_params(axis='y', labelsize=12)
    axes[i].grid(visible=True, linestyle='--', alpha=0.5)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()
No description has been provided for this image

Over the last two quarters, the bank has maintained a high CET1 ratio, indicating a strong capital position. However, net income and EPS spiked in 2Q24 before dropping in 3Q24, signaling earnings volatility that could impact profitability. Additionally, elevated provisions for credit losses suggest caution around potential credit risks. Overall, while capital strength is reassuring, addressing credit risk and earnings instability will be crucial for sustaining financial health and investor confidence.

InΒ [Β ]:
# Creating a dataframs for the 2Q24 and 3Q24 Q&As
Q23_FY24 = transcripts_df[(transcripts_df['date'] >= '2024-07-01') & (transcripts_df['date'] < '2024-11-01')]

# Pulling all words from qa_text_processed column into a list
Q23_FY24_all_words = [word for tokens in Q23_FY24['qa_text_processed'] for word in tokens]

# Calculating the frequency distribution of the words from the dataset
Q23_FY24_freq_dist = FreqDist(Q23_FY24_all_words)

# Getting the top 10 words and their frequencies from the 2Q24 and 3Q24 Q&As
Q23_FY24_top_10 = Q23_FY24_freq_dist.most_common(10)
Q23_FY24_words, Q23_FY24_counts = zip(*Q23_FY24_top_10)
InΒ [Β ]:
# Setting up the plot
fig, ax1 = plt.subplots(figsize=(14, 6))

# Plotting the top 10 words in a barplot for the 2Q24 and 3Q24 Q&As
sns.barplot(x=list(Q23_FY24_words), y=list(Q23_FY24_counts), palette="viridis", ax=ax1)
ax1.set_title('Top 10 Words in the 2Q24 and 3Q24 Q&As')
ax1.set_xlabel('Words')
ax1.set_ylabel('Frequency')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()

image.png

InΒ [Β ]:
# Creating a dataframe excluding 2Q24 and 3Q24 Q&As
not_Q23_FY24 = transcripts_df[(transcripts_df['date'] < '2024-07-01') | (transcripts_df['date'] >= '2024-11-01')]

# Getting unique words for the 2Q24 and 3Q24 Q&As
Q23_FY24_unique_words = set(Q23_FY24_freq_dist.keys()) - set(not_Q23_FY24['qa_text_processed'].explode())
Q23_FY24_unique_counts = {word: Q23_FY24_freq_dist[word] for word in Q23_FY24_unique_words}
Q23_FY24_unique_df = pd.DataFrame(Q23_FY24_unique_counts.items(), columns=['word', 'Q23_FY24_frequency'])

# Getting the top 10 unique words for the 2Q24 and 3Q24 Q&As
Q23_FY24_top_words = Q23_FY24_unique_df.nlargest(10, 'Q23_FY24_frequency')
InΒ [Β ]:
# Setting up the plot
fig, ax1 = plt.subplots(figsize=(14, 6))

# Creating a bar plot for the top 10 2Q24 and 3Q24 exclusive words
sns.barplot(data=Q23_FY24_top_words, x='word', y='Q23_FY24_frequency', palette='Blues')
plt.title('Top 10 Unique Words Used in the 2Q24 and 3Q24 Q&As')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)

# Adjusting the layout
plt.tight_layout()

# Showing the plots
plt.show()

image.png

Words like β€œspike” and β€œtrough” could reflect the recent fluctuations in net income and EPS.

InΒ [Β ]:
# Calculating word frequencies and total word count in 2Q24 and 3Q24 Q&As
Q23_FY24_freq = transcripts_df[(transcripts_df['date'] >= '2024-07-01') & (transcripts_df['date'] < '2024-11-01')]
Q23_FY24_word_counts = Q23_FY24_freq['qa_text_processed'].explode().value_counts()
Q23_FY24_total_words = Q23_FY24_word_counts.sum()  # Total word count for 2Q24 & 3Q24

# Calculating word frequencies and total word count in other quarters
not_Q23_FY24_word_counts = not_Q23_FY24['qa_text_processed'].explode().value_counts()
not_Q23_FY24_total_words = not_Q23_FY24_word_counts.sum()  # Total word count for other quarters

# Creating a DataFrame comparing relative frequencies
word_comparison_df = pd.DataFrame({
    'Q23_FY24_proportion': Q23_FY24_word_counts / Q23_FY24_total_words,
    'other_quarters_proportion': not_Q23_FY24_word_counts / not_Q23_FY24_total_words
}).fillna(0)

# Adding a column for proportion difference
word_comparison_df['proportion_difference'] = word_comparison_df['Q23_FY24_proportion'] - word_comparison_df['other_quarters_proportion']

# Filtering for words that are relatively more frequent in 2Q24 and 3Q24
higher_in_Q23_FY24 = word_comparison_df[word_comparison_df['proportion_difference'] > 0]

# Selecting the top 10 words with the highest proportion difference
top_higher_words = higher_in_Q23_FY24.nlargest(10, 'proportion_difference').reset_index().rename(columns={'qa_text_processed': 'word'})
InΒ [Β ]:
# Plotting the top words by proportional difference
fig, ax = plt.subplots(figsize=(14, 6))
sns.barplot(data=top_higher_words, x='word', y='proportion_difference', palette='Purples')
plt.title('Top 10 Words with Higher Proportion in 2Q24 and 3Q24 Compared to Other Quarters')
plt.xlabel('Words')
plt.ylabel('Proportional Difference')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

image.png

A higher usage of "capital" could relate to how the bank has maintained a high CET1 capital ratio, indicating a strong capital position.

A higher usage of "nii" (net interest income) could relate to the recent fluctuations in net income and EPS.

3.2.1 SentimentΒΆ

InΒ [Β ]:
full_result = pd.read_csv(output_data_folder + "/sentiment_full_result.csv")
full_result['date'] = pd.to_datetime(full_result['date'])
InΒ [Β ]:
score_by_quarter = full_result.groupby('date')['financial-roberta-large_score'].mean()
year_quarter = full_result[['date', 'year', 'quarter']].drop_duplicates()
score_by_quarter = pd.merge(score_by_quarter, year_quarter, on='date', how='left')
score_by_quarter['year_quarter'] = score_by_quarter['quarter'].astype(str) + 'Q' + (score_by_quarter['year']-2000).astype(str)
score_by_quarter.set_index('year_quarter', inplace=True)

fig, ax1 = plt.subplots(figsize=(10,3))

sns.lineplot(x=score_by_quarter.index, y=score_by_quarter['financial-roberta-large_score'], ax=ax1,
              color='orange', lw=2, label='average sentiment', legend=False)
xticks = [f"{row['quarter']}Q{row['year']-2000}" for _, row in score_by_quarter.iterrows()]
plt.xticks(ticks=score_by_quarter.index, labels=xticks, rotation=25)
ax1.set_xlabel('')
ax1.set_ylabel('Average sentiment score',  fontsize=14)
ax1.set_ylim(-0.3, 0.3)
for label in ax1.get_yticklabels():
  label.set_size(fontsize=14)
plt.axhline(y=0, lw=0.8, ls=(0, (5,10)), color='black')
ax1.axvspan(12,13, facecolor='orange', alpha=0.2, edgecolor=None)
ax1.set_title("Average sentiment score over time")
for label in ax1.get_xticklabels():
  label.set_size(fontsize=12)

fig.legend(fontsize=14, loc='center left', bbox_to_anchor=(0.9, 0.5))
No description has been provided for this image
InΒ [Β ]:
quarter_count = full_result.groupby(['year', 'quarter'])['financial-roberta-large_sentiment'].value_counts(normalize=True).reset_index()
quarter_count = quarter_count.pivot(index=['year', 'quarter'], columns='financial-roberta-large_sentiment', values='proportion').reset_index()

fig, ax1 = plt.subplots(figsize=(10,3))
sns.lineplot(x=quarter_count['quarter'].astype(str)+'Q'+(quarter_count['year']-2000).astype(str),
             y=quarter_count['negative'],
             label='negative',lw=2,
             color='lightblue',
             legend=False)
sns.lineplot(x=quarter_count['quarter'].astype(str)+'Q'+(quarter_count['year']-2000).astype(str),
             y=quarter_count['positive'],
             label='positive',
             color='lightcoral',
             ls='dashed', lw=2,
             legend=False)
sns.lineplot(x=quarter_count['quarter'].astype(str)+'Q'+(quarter_count['year']-2000).astype(str),
             y=quarter_count['neutral'],
             label='neutral',
             color='grey',lw=2,
             ls='dashdot',
             legend=False)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('Proportion of sentiment', fontsize=14)
for label in ax1.get_yticklabels():
  label.set_size(fontsize=14)
ax1.axvspan(12,13, facecolor='orange', alpha=0.2, edgecolor=None)

for label in ax1.get_xticklabels():
  label.set_size(fontsize=12)

fig.legend(fontsize=14, loc='center left', bbox_to_anchor=(0.9, 0.5))
No description has been provided for this image

Here we looked at the evolution of sentiment over time with the most recent 2 quarters highlighted.

We can see that the average sentiment decreases. However it is not because negative sentiment has increased but because neutral sentiment has become more prevalent at the expense of positive.

InΒ [Β ]:
recent_result = full_result[(full_result['year']==2024) & full_result['quarter'].isin([3,2])]
recent_count = recent_result['financial-roberta-large_sentiment'].value_counts()
plt.pie(recent_count, labels=recent_count.index, autopct='%1.1f%%', startangle=90, colors=['grey', 'lightcoral', 'lightblue'])
plt.title('Question and answer sentiment from 2024Q2 and 2024Q3')
plt.show()
No description has been provided for this image

This is the sentiment split in the highlighted quarters. Mostly neutral with a fairly equal split of +ve and -ve

It is worth of exploring more of the negative texts since we are interested in emerging risks.

3.2.2 Topic modelling on negative-sentiment textsΒΆ

3.2.2.1 BERTopicΒΆ

InΒ [Β ]:
sentiment_df = pd.read_csv(output_data_folder + "/sentiment_full_result.csv")
df_phi_fulltable_summarised = pd.read_excel(processed_data_folder + '/phi_fulltable_summarised.xlsx')
# subset relevant columns from sentiment file
sentiment_df_24Q2Q3= sentiment_df[(sentiment_df.year.isin([2024])) & (sentiment_df.quarter.isin([2, 3]))]
sentiment_colname = 'financial-roberta-large_sentiment'
sentiment_df_24Q2Q3 = sentiment_df_24Q2Q3[['uid', sentiment_colname]]
# subset relevant columns from summarised text file
summarised_24Q2Q3_df = df_phi_fulltable_summarised_2024Q2_Q3[['uid', 'summarised_text']]
sentiment_df_24Q2Q3 = sentiment_df_24Q2Q3[sentiment_df_24Q2Q3['financial-roberta-large_sentiment'] == 'negative'][['uid', 'financial-roberta-large_sentiment']]
# merge neagtive data with ground truth data
df_sentiment_24Q2Q3_summarised = pd.merge(sentiment_df_24Q2Q3,summarised_24Q2Q3_df , on=['uid'], how='inner')
InΒ [Β ]:
df_phi_2024Q2Q3_neg_list = df_sentiment_24Q2Q3_summarised['summarised_text'].apply(preprocess_spacy).to_list()
dim_model = PCA(n_components=18)
cluster_model = KMeans(n_clusters=18)

topic_model_phi_2024Q2Q3_neg = BERTopic( embedding_model=embedding_model,
                       hdbscan_model=cluster_model, calculate_probabilities=True)
topics, probabilities = topic_model_phi_2024Q2Q3_neg.fit_transform(df_phi_2024Q2Q3_neg_list)
# Reduce topics with higher diversity
topic_model_phi_2024Q2Q3_neg = topic_model_phi_2024Q2Q3_neg.reduce_topics(df_phi_2024Q2Q3_neg_list, nr_topics=5)
InΒ [Β ]:
# Plot each BERTtopic visualization into a subplot
topic_model_phi_2024Q2Q3_neg.visualize_topics().write_html("topic_model_phi_2024Q2Q3_neg_topic.html")
topic_model_phi_2024Q2Q3_neg.visualize_barchart(n_words=8, autoscale=True).write_html("topic_model_phi_2024Q2Q3_neg_barchart.html")
topic_model_phi_2024Q2Q3_neg.visualize_heatmap().write_html("topic_model_phi_2024Q2Q3_neg_heatmap.html")
topic_model_phi_2024Q2Q3_neg.visualize_hierarchy().write_html("topic_model_phi_2024Q2Q3_neg_hierarchy.html")

# Load each plot into a subplot

#display(HTML("topic_model_phi_2024Q2Q3_neg_topic.html"))
display(HTML("topic_model_phi_2024Q2Q3_neg_barchart.html"))
display(HTML("topic_model_phi_2024Q2Q3_neg_heatmap.html"))
display(HTML("topic_model_phi_2024Q2Q3_neg_hierarchy.html"))

plt.tight_layout()
plt.show()

image.png

3.2.2.2 FinBERTΒΆ

Topic distribution in negative sentiment texts.

InΒ [Β ]:
plot_topics_by_condition(condition_dict={"quarter_str": ["2Q24"]},
                          finbert_folder=output_folder,
                          data_folder=data_folder,
                          label_dict=id2label,
                          max_length=512,
                          chunking=False,
                          bank='JPMorgan',
                          datatype='QA',
                         sentiment='negative',
                         appdx='_summarised',
                         sentiment_folder=sentiment_folder)
No description has been provided for this image

Top three topics in negative texts:

  • Fed | Central Banks
  • Macro
  • Financials

3.2.3 EvasionΒΆ

InΒ [Β ]:
filtered_df_last2Q = quarter_df_results[(quarter_df_results['quarter_ID'] == '2Q24') | (quarter_df_results['quarter_ID'] == '3Q24')]
InΒ [Β ]:
# Plot a pie chart
plot_pie_chart(filtered_df_last2Q['Phi-3.5 Evasion Present'], '2Q24/3Q24')
No description has been provided for this image

Over 80% of answers were classed as evasive.

Note that Phi-3.5 does overestimate Evasiveness in answers.

3.2.3.1 Topic modelling on evasive textsΒΆ

3.2.3.1.1 BERTopicΒΆ
InΒ [Β ]:
df_phi_fulltable_summarised = pd.read_excel(processed_data_folder + '/phi_fulltable_summarised.xlsx')
InΒ [Β ]:
# subset relevant columns from sentiment file
qevasion_colname = 'Phi-3.5 Evasion Present'
qevasion_df = filtered_df_last2Q[['qa_num', qevasion_colname]]

# subset relevant columns from summarised text file
fulltable_summarised_df = df_phi_fulltable_summarised[['uid', 'qa_num', 'summarised_text']]
qevasion_status='Evasive'
df_qevasion_summarised = pd.merge(fulltable_summarised_df, qevasion_df, on=['qa_num'], how='left')

# subset relevant columns from sentiment file
qevasion_colname = 'Phi-3.5 Evasion Present'
qevasion_df = qevasion_df[['qa_num', qevasion_colname]]

# subset relevant columns from summarised text file
fulltable_summarised_df = fulltable_summarised_df[['uid', 'qa_num', 'summarised_text']]
# merge two dataset
qevasion_status='Evasive'
df_qevasion_summarised = pd.merge(fulltable_summarised_df, qevasion_df, on=['qa_num'], how='left')
df_qevasion_summarised['quarter_str'] = [x.split("_")[1] for x in df_qevasion_summarised['uid']]
df_evaded_summarised = df_qevasion_summarised[df_qevasion_summarised[qevasion_colname].eq(qevasion_status)].copy().reset_index(drop=True)
InΒ [Β ]:
quarter_str_status_24 = ('2Q24','3Q24')
quarter_colname = 'quarter_str'
df_evaded_summarised_24Q2Q3 = df_evaded_summarised[df_evaded_summarised[quarter_colname] != (quarter_str_status_22)]
df_evaded_summarised_24Q2Q3_list = df_evaded_summarised_24Q2Q3['summarised_text'].apply(preprocess_spacy).to_list()
InΒ [Β ]:
dim_model = PCA(n_components=5)
cluster_model = KMeans(n_clusters=20)

topic_evaded_summarised_24Q2Q3 = BERTopic(umap_model=dim_model, embedding_model=embedding_model,
                       hdbscan_model=cluster_model, calculate_probabilities=True)
topics, probs = topic_evaded_summarised_24Q2Q3.fit_transform(df_evaded_summarised_24Q2Q3_list)
topic_evaded_summarised_24Q2Q3 = topic_evaded_summarised_24Q2Q3.reduce_topics(df_evaded_summarised_24Q2Q3_list, nr_topics=7)
InΒ [Β ]:
# Plot each BERTtopic visualization into a subplot
topic_evaded_summarised_24Q2Q3.visualize_topics().write_html("topic_evaded_summarised_24Q2Q3_topic.html")
topic_evaded_summarised_24Q2Q3.visualize_barchart(top_n_topics=20,n_words=8, autoscale=True).write_html("topic_evaded_summarised_24Q2Q3_barchart.html")
topic_evaded_summarised_24Q2Q3.visualize_heatmap().write_html("topic_evaded_summarised_24Q2Q3_heatmap.html")
topic_evaded_summarised_24Q2Q3.visualize_hierarchy().write_html("topic_evaded_summarised_24Q2Q3_hierarchy.html")

# Load each plot into a subplot

#display(HTML("topic_evaded_summarised_24Q2Q3_topic.html"))
display(HTML("topic_evaded_summarised_24Q2Q3_barchart.html"))
display(HTML("topic_evaded_summarised_24Q2Q3_heatmap.html"))
display(HTML("topic_evaded_summarised_24Q2Q3_hierarchy.html"))

plt.tight_layout()
plt.show()

image.png

Topics associated with evaded questions were capital market and growth dynamics (including NII), and financial indicators.

3.2.4 Evasion + NegativityΒΆ

To look at the overlapped topics in evasion and negativity, we create a Venn diagram.

InΒ [Β ]:
# Create a Venn diagram
keywords_evaded_24Q2Q3 = extract_keywords(topic_evaded_summarised_24Q2Q3)
keywords_phi_24Q2Q3neg = extract_keywords(topic_model_phi_2024Q2Q3_neg)
venn_24  =venn2(
    [keywords_evaded_24Q2Q3, keywords_phi_24Q2Q3neg],
    ('Evaded  ', 'Negative ')
)
# Set colors
venn_24.get_patch_by_id('10').set_color('#FFBF00')  # Evaded only
venn_24.get_patch_by_id('01').set_color('#00BFFF')    # Negative only
# Add a title
plt.title("BERTopic Word Frequencies ")
plt.show()
No description has been provided for this image
InΒ [Β ]:
# Inspect overlaps
intersection_evaded_phi_24Q2Q3neg = keywords_evaded_24Q2Q3 & keywords_phi_24Q2Q3neg
print("Overlap between Evaded and Phi Negative:", intersection_evaded_phi_24Q2Q3neg)
Overlap between Evaded and Phi Negative: {'income', 'rate', 'market', 'modest', 'performance', 'deposit', 'growth', 'potential', 'closer', 'curve', 'ongoing', 'change', 'relation', 'nii'}

These are the most common words in topics discovered by our model (bert) They correspond to the most frequent topics in our analysis of both negative and evaded topics

3.2.5 Regulatory keyword outputsΒΆ

InΒ [Β ]:
# Function to create a wordcloud (basic)
def create_wordcloud_basic(title, topics):
  wordcloud = WordCloud(width=1000, height=600, background_color='white').generate(" ".join(topics))
  plt.figure(figsize=(10, 6))
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.title(f'{title}')
  plt.axis('off')
  plt.show()
InΒ [Β ]:
# Filter the quarters of interest df to be where 'keywords' column is not null or empty
filtered_df_keywords = quarter_df_results[quarter_df_results['keywords'].notna() & (quarter_df_results['keywords'] != '')]
InΒ [Β ]:
"""
  Look at all text in answers where Basel III mentioned.
  Create a wordcloud for easy visualisation of common words in these answers.
"""

# Extract and clean 'qa_text' by words
qa_topics = [
    topic.strip()
    for row in filtered_df_keywords['qa_text'].dropna()
    for topic in row.split(' ')
]

# Remove uninformative words
words_to_remove = ['Jamie', 'obviously', 'thing', 'whatever', 'right', 'know', 'look', 'think', 'question', 'actually', 'really', 'still', 'III', 'yeah']
qa_topics = [topic for topic in qa_topics if topic.lower() not in map(str.lower, words_to_remove)]

# Plot wordcloud of qa_text where Basel III mentioned
create_wordcloud_basic('WordCloud of Q&As where Basel III mentioned', qa_topics)
No description has been provided for this image
InΒ [Β ]:
"""
  Create filtered dataframe containing only entries where 'keywords' is not empty.
  Analyse the Evaded topics for these entries.
  Note that all these answers were classed as Evasive.
"""
# Extract and clean the 'Phi-3.5 Evaded Topics
evaded_topics = [
    topic.strip()
    for row in filtered_df_keywords['Phi-3.5 Evaded Topics'].dropna()
    for topic in row.split(',')
]

# Remove uninformative words (e.g., Basel III Endgame) from wordcloud (as the title includes Basel III)
words_to_remove = ['Basel', 'III', 'Endgame', 'Basel III Endgame']
evaded_wordcloud_topics = [topic for topic in evaded_topics if topic.lower() not in map(str.lower, words_to_remove)]

# Plot a wordcloud for evaded topics where Basel III mentioned
create_wordcloud_basic('Evaded Topics where Basel III mentioned', evaded_wordcloud_topics)

# Display list of topics
topic_series = pd.Series(evaded_topics)
topic_counts = topic_series.value_counts()
print("\nEvaded Topics where Basel III mentioned:")
display(topic_counts)
No description has been provided for this image
Evaded Topics where Basel III mentioned:
count
Basel III Endgame 4
NII normalization 1
digital banking 1
Impact on Lending 1
Proposal Details 1
Basel III 1
Capital Requirements 1
Capital Return and Buyback Trajectory 1
GSIB Surcharge Calculations 1
Capital Scenarios 1
ROTCE 1
competitive market 1
Yield curve effects 1
share growth 1
17% capital 1
GSIB recalibration 1
SCB and CCAR 1
CET1 ratio 1
RWA 1
Economic environment 1
Overhead ratio 1
Cost-Benefit Analysis 1

Evasive Topics: focus on Capital requirements / returns, GSIB, NII, Yield, CET1 ratio.

4 ConclusionsΒΆ

To derive meaningful insight from transcripts, we recommend a solution pipeline that involves identification of texts with specific sentiment or question evasion status, followed by topic modelling using BERTopic on Phi-3.5-summarised text. By focusing on negative-sentiment Q&A in 1Q22, BERTopic highlighted themes like declining Net Interest Income in 55% of texts and geopolitical risks in 27% of texts. In evasive answers, the most common topic (60% of texts) concerned corporate growth, but geopolitical challenges were present too (9%). These findings enhanced our understanding of sentiment and thematic concerns during critical periods.

Emerging risks were identified through analysis of the two most recent quarters, where we observed a decline in the average sentiment, driven by greater neutrality at the expense of positive sentiment. Overlapping BERTopic-discovered themes in negative and evasive Q&As provided valuable insights into areas such as capital market uncertainty and reserve dynamics that might require closer regulatory attention. Detection of keywords related to regulatory changes like Basel III, capital adequacy, and credit loss provisions contextualised these findings, providing actionable insights into emerging risks.

The methodology shows potential for scalability with tailored preprocessing for different transcript formats. Automating transcript preprocessing would enhance adaptability. Generalised models, such as Phi-3.5 and BERTopic, already exhibit strong applicability across financial datasets, but fine-tuning of Phi-3.5 with domain-specific training datasets is likely to further improve performance.